Commit graph

22 commits

Author SHA1 Message Date
Viktor Barzin
afb8a16623 [infra] Scale down unused services + remove DoH ingress
Scale to 0 replicas:
- ollama: low usage, saves ~2Gi memory + 59GB NFS-SSD model data idle
- poison-fountain: RSS link archiver, not actively used
- travel-blog: Hugo blog, not actively used

Remove technitium DoH ingress (dns.viktorbarzin.me): externally unreachable
and unused. DNS is served on UDP/TCP port 53 via LoadBalancer (10.0.20.201).

Clears 3 of 5 ExternalAccessDivergence services. Remaining 2 (pdf, travel)
should clear now that the Uptime Kuma monitors will report both down.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:55:52 +00:00
Viktor Barzin
996bdfc9b6 [technitium] Uninstall MySQL+SQLite query log plugins instead of just disabling
## Context
Disabling MySQL/SQLite query logging via config was not durable — Technitium
re-enables disabled plugins on pod restart, causing 46 GB/day of writes to
the standalone MySQL (15M inserts to technitium.dns_logs between CronJob runs).

## This change:
The password-sync CronJob now UNINSTALLS MySQL and SQLite query log plugins
via `/api/apps/uninstall` instead of setting `enableLogging:false`. This is
permanent — the plugin files are removed from the PVC, so they can't re-enable
on restart. The CronJob checks if the plugins are present first (idempotent).

Only PostgreSQL query logging remains (90-day retention).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 08:20:55 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
f538115c43 [dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.

Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.

## This change:
- Replace helm_release.mysql_cluster service selector with raw
  kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
  innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
  keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes

## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)

Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:01:06 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
9baefa22ab fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape
- technitium-password-sync: remove RWO encrypted PVC mount that caused
  pods to stick in ContainerCreating on wrong nodes. Plugin install now
  warns instead of failing when zip unavailable.
- daily-backup: add LUKS decryption support for encrypted PVC snapshots
  using /root/.luks-backup-key. Uses noload mount option to skip ext4
  journal replay. Also installed cryptsetup-bin on PVE host.
- speedtest: disable prometheus.io/scrape annotation (no /prometheus
  endpoint exists, causing ScrapeTargetDown alert).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:12:32 +00:00
Viktor Barzin
803cb5fd26 fix: convert Technitium zone sync from one-time Job to CronJob
Secondary/tertiary DNS instances had no custom zones — only the
primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job
ran once at deployment and never synced new zones.

New CronJob runs every 30 minutes:
- Gets all zones from primary
- Enables zone transfer on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones via AXFR

Fixes .lan resolution failures (2/3 queries returned NXDOMAIN).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:18:19 +00:00
Viktor Barzin
68c8c5b4a0 fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem
SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory
mounts from k8s (NFSv4 pseudo-root path resolution). Combined with
lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded
into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services.

Changes:
- Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted
- Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted
- Removed NFS module dependency from technitium stack
- Added full post-mortem with prevention plan

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:18:59 +00:00
Viktor Barzin
82b0f6c4cb truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
  (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV

Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
Viktor Barzin
6101fb99f9 Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip]
- Prometheus: persist metric whitelist (keep rules) to Helm template, preventing
  regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w.
- MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0,
  doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners.
- etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency.
- VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module.
- Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress).
- Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3.
- Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:01:21 +00:00
Viktor Barzin
9de543c076 feat(technitium): add Split Horizon AddressTranslation for hairpin NAT fix
192.168.1.x LAN clients couldn't reach non-proxied *.viktorbarzin.me
domains because the TP-Link router doesn't support hairpin NAT.

Adds a CronJob that configures Technitium's Split Horizon
AddressTranslation post-processor on all 3 instances to translate
176.12.22.76 (public IP) → 10.0.20.200 (Traefik LB) in DNS responses
for 192.168.1.0/24 clients. Also adds viktorbarzin.me to the DNS
Rebinding Protection privateDomains allowlist so the translated private
IP isn't stripped.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 18:43:15 +00:00
Viktor Barzin
09b4bad958 feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline
Pin third-party images from :latest to current stable versions:
- Platform: cloudflared, technitium, snmp-exporter, pve-exporter,
  headscale, shadowsocks, xray
- Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse,
  n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe,
  cyberchef, networking-toolbox, echo, coturn, shlink, affine

Enable DIUN annotations on all pinned deployments with per-image
tag patterns. Add Woodpecker app-stacks pipeline for selective
terragrunt apply on changed app stacks.
2026-04-06 14:27:13 +03:00
Viktor Barzin
9492874c43 fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.

- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
2026-04-06 13:00:49 +03:00
Viktor Barzin
aac02f0467 meshcentral: restore DB from backup; technitium: remove orphaned PVC
- meshcentral: fix homepage annotations formatting (no functional change,
  serversscheme was tested but not needed since MeshCentral serves HTTP)
- meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB)
- technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer,
  never mounted — primary uses NFS, replicas have their own proxmox PVCs)
2026-04-06 12:17:08 +03:00
Viktor Barzin
b0178cf6d2 technitium: add tertiary DNS replica and fix CoreDNS forward order
- Add tertiary DNS deployment with zone-transfer replication for
  externalTrafficPolicy=Local coverage across more nodes
- Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first,
  then public DNS fallbacks (8.8.8.8, 1.1.1.1)
2026-04-06 11:57:31 +03:00
Viktor Barzin
f80e1fa868 cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
  update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
  instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
2026-04-06 11:54:45 +03:00
Viktor Barzin
aa7a7e74b2 fix: technitium secondary to proxmox-lvm + bootstrap TF state
- Migrate technitium-secondary-config from NFS to proxmox-lvm PVC
- Change secondary strategy from RollingUpdate to Recreate (RWO)
- Bootstrap encrypted state for insta2spotify and ebooks stacks
- Import servarr sub-module PVCs and reconcile state
2026-04-05 19:32:40 +03:00
Viktor Barzin
cb8a808700 feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.

Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).

Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
Viktor Barzin
4b3851829b feat: organize Grafana dashboards into folders
Enable sidecar folderAnnotation + foldersFromFilesStructure to group
26 dashboards into 5 managed folders:

- Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics
- Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic
- Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU
- Operations (4): backup health, registry, audit logs, Loki
- Applications (2): realestate-crawler, qBittorrent

Dashboard-to-folder mapping defined in grafana.tf locals block.
External stacks (headscale, technitium) annotated individually.
2026-03-28 16:23:49 +02:00
Viktor Barzin
c49e4561a3 consolidate MetalLB IPs: 5 → 1 (10.0.20.200)
Migrate all 11 LoadBalancer services to share 10.0.20.200:
- Update annotations: metallb.universe.tf → metallb.io
- Pin all services to 10.0.20.200 with allow-shared-ip: shared
- Standardize externalTrafficPolicy to Cluster (required for IP sharing)
- Remove redundant port 80 (roundcube) from mailserver LB
- Update CoreDNS forward: 10.0.20.204 → 10.0.20.200
- Update cloudflared tunnel target: 10.0.20.202 → 10.0.20.200

Services consolidated: coturn, headscale, kms, qbittorrent, shadowsocks,
torrserver, wireguard, mailserver, traefik, xray, technitium
2026-03-24 18:35:43 +02:00
Viktor Barzin
33037eba46 upgrade MetalLB v0.10.2 → v0.15.3 and update annotations
- Replace custom ViktorBarzin/metallb module with official Helm chart
- Migrate from ConfigMap-based config to CRD (IPAddressPool + L2Advertisement)
- Update Traefik LB annotations from metallb.universe.tf to metallb.io format
- Technitium DNS keeps stable IP 10.0.20.204 via MetalLB auto-assignment
- Headscale split DNS already configured to use 10.0.20.204
2026-03-24 17:24:05 +02:00
Viktor Barzin
73511b1230 extract remaining 19 modules from platform, complete stack split [ci skip]
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.
2026-03-17 21:42:16 +00:00