infra

Author	SHA1	Message	Date
Viktor Barzin	b034c868db	[traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping. Both plugins load without errors but never inject content. Removed: - rewrite-body plugin download (init container) and registration - strip-accept-encoding middleware (only existed for rewrite-body bug) - anti-ai-trap-links middleware (used rewrite-body for injection) - rybbit_site_id variable from ingress_factory and reverse_proxy factory - rybbit_site_id from 25 service stacks (39 instances) - Per-service rybbit-analytics middleware CRD resources Kept: - compress middleware (entrypoint-level, working correctly) - ai-bot-block middleware (ForwardAuth to bot-block-proxy) - anti-ai-headers middleware (X-Robots-Tag: noai, noimageai) - All CrowdSec, Authentik, rate-limit middleware unchanged Next: Cloudflare Workers with HTMLRewriter for edge-side injection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 12:41:17 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	1613003d00	upgrade: vaultwarden 1.35.4 -> 1.35.7 Security fixes (1.35.5): 3 CVEs — org vault purge by unconfirmed owner (GHSA-937x-3j8m-7w7p), cross-org group binding unauthorized access (GHSA-569v-845w-g82p), refresh tokens not invalidated on stamp rotation (GHSA-6j4w-g4jh-xjfx). 2FA remember tokens now max 30 days. 1.35.6: Fix 2FA remember tokens broken in 1.35.5. 1.35.7: Fix 2FA for Android. Risk: SAFE (patch bump, no breaking changes) DB backup: yes (job: pre-upgrade-vaultwarden-1776280439, SQLite, 7 MiB) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-15 19:14:21 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	ce7b8c2b2e	add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip] Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding info alert at 80%.	2026-04-03 23:30:00 +03:00
Viktor Barzin	dd59512153	migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip] Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes the iSCSI network hop for database I/O. New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart with StorageClass "proxmox-lvm" using existing local-lvm thin pool. Migrated PVCs (12 total): - Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus - Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2) All services verified healthy post-migration.	2026-04-02 22:13:04 +03:00
Viktor Barzin	d20c5e5535	add backup_output_bytes metric and cloudsync_transferred_bytes to backup dashboard - All 7 backup CronJobs now push backup_output_bytes (file size after backup) - Cloud Sync monitor parses rclone transfer stats into cloudsync_transferred_bytes - Grafana dashboard: new Output (MiB) table column, Output Size Trend panel, Write Throughput panel, Cloud Sync Transfer Volume bargauge - All timeseries panels use points-only draw style (discrete backup snapshots) - etcd backup restructured: init_container for etcdctl (distroless image), busybox sidecar for metrics push + purge, ClusterFirstWithHostNet DNS - Fixed pre-existing curl missing in postgres:16.4-bullseye (immich, dbaas PG) - Fixed grep -oP not available in alpine/busybox (cloud sync monitor)	2026-03-25 10:44:53 +02:00
Viktor Barzin	a95d434ff1	fix backup IO stats: use /proc/$$/io instead of /proc/self/io /proc/self/io inside $(awk ...) resolves to the awk subprocess PID, not the parent bash shell. Use $$ (bash PID) to read the correct process IO counters.	2026-03-23 12:33:52 +02:00
Viktor Barzin	0a294a30a6	add backup IO logging, Pushgateway metrics, and Grafana dashboard - Add /proc/self/io read/write tracking to vault raft-backup and etcd backup - Push backup_duration_seconds, backup_read_bytes, backup_written_bytes, backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs (etcd skipped — distroless image has no wget/curl) - Add cloudsync_duration_seconds metric to cloudsync-monitor - New "Backup Health" Grafana dashboard with 8 panels: time since last backup, overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule	2026-03-23 12:19:01 +02:00
Viktor Barzin	311ff5dd9e	add hourly SQLite integrity check for vaultwarden with Prometheus alerting - New CronJob runs PRAGMA integrity_check every hour - Pushes vaultwarden_sqlite_integrity_ok metric to Prometheus pushgateway - VaultwardenSQLiteCorrupt alert fires immediately on corruption (critical) - VaultwardenIntegrityCheckStale alert if check hasn't run in 2h (warning) - Prevents running for days on a corrupted DB unnoticed	2026-03-23 00:50:15 +02:00
Viktor Barzin	a44f35bcf8	harden vaultwarden iSCSI storage and increase backup frequency - Increase backup from daily to every 6 hours (0 /6 * *) - Add pre/post-flight SQLite integrity checks to backup job - Harden iSCSI on all nodes: increase recovery timeout (300s), enable CRC32C data/header digests for bit-flip detection - Fix restore runbook PVC name (vaultwarden-data-iscsi) Motivated by SQLite corruption from iSCSI I/O errors.	2026-03-23 00:36:11 +02:00
Viktor Barzin	73511b1230	extract remaining 19 modules from platform, complete stack split [ci skip] Phase 3: all 27 platform modules now run as independent stacks. Platform reduced to empty shell (outputs only) for backward compat with 72 app stacks that declare dependency "platform". Fixed technitium cross-module dashboard reference by copying file. Woodpecker pipeline applies all 27+1 stacks in parallel via loop. All applied with zero destroys.	2026-03-17 21:42:16 +00:00

13 commits