infra

Author	SHA1	Message	Date
Viktor Barzin	752f94ab8f	[monitoring] Opt-out external monitor for family/mladost3/task-webhook/torrserver; drop r730 The `external-monitor-sync` script is opt-IN by default for any *.viktorbarzin.me ingress, so a missing annotation means "monitored." Both ingress factories previously OMITTED the annotation when `external_monitor = false`, which silently left monitors in place. Fix: when the caller sets `external_monitor = false` explicitly, emit `uptime.viktorbarzin.me/external-monitor = "false"` so the sync script deletes the monitor. Keep the previous behavior (no annotation) for callers that leave external_monitor null — otherwise 19 publicly-reachable services with `dns_type="none"` would lose monitoring. Set external_monitor=false on family (grampsweb) and mladost3 (reverse-proxy) to match the other two already-flagged services. Delete the r730 ingress module entirely — the Dell server has been decommissioned.	2026-04-19 15:18:27 +00:00
Viktor Barzin	a86a97deb7	[reverse-proxy] Fix gw.viktorbarzin.me — point at 192.168.1.1 via EndpointSlice The TP-Link gateway was wired via ExternalName `gw.viktorbarzin.lan`, but Technitium has no record for that name (the router isn't a DHCP client and Kea DDNS never registers it), so the ingress backend returned NXDOMAIN and the `[External] gw` Uptime Kuma monitor was permanently failing. Factory now accepts `backend_ip` as an alternative to `external_name`: it creates a selector-less ClusterIP Service + manual EndpointSlice pointing at the given IP, bypassing cluster DNS entirely. Used for gw (192.168.1.1); the old ExternalName path is retained for every other service. Also add a direct `port` monitor for the router in uptime-kuma's internal_monitors list so we can tell a Cloudflare/tunnel outage apart from the router itself being down. Extended the internal-monitor-sync script to handle non-DB monitor types (hostname + port fields).	2026-04-19 15:07:24 +00:00
Viktor Barzin	2431c6d5fe	[reverse-proxy] ha-sofia per-service retry + ServersTransport Adds a ha-sofia-retry Middleware (attempts=3, initialInterval=100ms) and ha-sofia-transport ServersTransport (dialTimeout=500ms) wired into ha-sofia + music-assistant ingresses. Absorbs the 67-156ms connect/DNS stalls that were surfacing as 18 x 502s/day without disturbing the global 2-attempt retry or Immich's 60s dialTimeout. depends_on the new manifests to avoid the dangling-reference pattern from the 2026-04-17 Traefik P0. Closes: code-rd1	2026-04-19 14:07:07 +00:00
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	b034c868db	[traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping. Both plugins load without errors but never inject content. Removed: - rewrite-body plugin download (init container) and registration - strip-accept-encoding middleware (only existed for rewrite-body bug) - anti-ai-trap-links middleware (used rewrite-body for injection) - rybbit_site_id variable from ingress_factory and reverse_proxy factory - rybbit_site_id from 25 service stacks (39 instances) - Per-service rybbit-analytics middleware CRD resources Kept: - compress middleware (entrypoint-level, working correctly) - ai-bot-block middleware (ForwardAuth to bot-block-proxy) - anti-ai-headers middleware (X-Robots-Tag: noai, noimageai) - All CrowdSec, Authentik, rate-limit middleware unchanged Next: Cloudflare Workers with HTMLRewriter for edge-side injection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 12:41:17 +00:00
Viktor Barzin	66d2d9916b	[infra] Per-ingress external-monitor annotation + actualbudget plan-time fix [ci skip] ## Context Two operational gaps surfaced during a healthcheck sweep today: 1. External monitoring coverage: Only ~13 hostnames (via `cloudflare_proxied_names` in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via `ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT registered for external probing — so outages like Immich going down externally were invisible until a user complained. 99 of ~125 public ingresses had no external monitor. 2. actualbudget stack unplannable: `count = var.budget_encryption_password != null ? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the value flows from a `data.kubernetes_secret` whose contents are `(known after apply)` at plan time. Blocked CI applies and drift reconciliation. ## This change ### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory) - New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string, nullable). Default is "follow dns_type" — enabled for any public DNS record (`dns_type != "none"`, covers both proxied and non-proxied so Immich and other direct-A records are also monitored). - Emits two annotations on the Ingress: - `uptime.viktorbarzin.me/external-monitor = "true"` - `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override) ### external-monitor-sync CronJob (uptime-kuma stack) - Discovers targets from live Ingress objects via the K8s API first (filter by annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any API error (zero rollout risk). - New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving `list`/`get` on `networking.k8s.io/ingresses`. - `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s) instead of `kubernetes.default.svc` — the search-domain expansion failed in the CronJob pod's DNS config. Verified working: CronJob now logs `Loaded N external monitor targets (source=k8s-api)`. ### actualbudget count-on-unknown refactor - Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at plan; no `-target` workaround needed. - Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is unchanged — the secret is still consumed via env var. - Also aligned the factory with live state (the 3 budget-* PVCs had been migrated `proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed `data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module removed. State was rm'd + re-imported with matching UIDs, so no data was moved. ## Rollout status (already partially applied in this session) - `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified - `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally - `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live - CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active (was 13 on the central list) ## Deferred (separate work) - 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory, rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade. `[ci skip]` here so those don't auto-apply; they will be fixed manually before the next CI push. - Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik, grafana, vault, forgejo) are annotated — separate PR. ## Test plan ### Automated \`\`\` \$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name \| tail -1) Loaded 26 external monitor targets (source=k8s-api) Sync complete: 7 created, 0 deleted, 17 unchanged \$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\ https://dawarich.viktorbarzin.me/ https://nextcloud.viktorbarzin.me/ \\ https://budget-viktor.viktorbarzin.me/ 200 302 200 \$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor deployment.apps/budget-viktor 1/1 1 1 Ready persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted \`\`\` ### Manual Verification 1. Confirm the annotation is present on an ingress_factory ingress: \`\`\` kubectl -n dawarich get ingress dawarich -o \\ jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}' # Expected: "true" \`\`\` 2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min (CronJob interval). For Immich specifically, it will appear after the immich stack is re-applied. 3. Verify actualbudget plan is clean: \`\`\` cd stacks/actualbudget && scripts/tg plan --non-interactive # Expected: no "Invalid count argument" errors \`\`\` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 10:34:32 +00:00
Viktor Barzin	f8facf44dd	[infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps ## Context The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI injection returned HTTP 200 with "Error 404: Not Found" body. Root cause: middleware specs referenced plugin name `rewrite-body` but Traefik registered it as `traefik-plugin-rewritebody`. Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3 which uses the correct plugin name. Also added `lastModified = true` and `methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML responses. ## This change - Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3 - Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI) - Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13) - Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts - Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule) - Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2, networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0 - MySQL standalone storage_limit 30Gi → 50Gi - beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 05:51:52 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	56583c3825	fix: add retry middleware and per-service rate limit for ha-sofia The global rate limit (10 req/s, 50 burst) was too aggressive for HA dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel blips between London K8s and Sofia caused 502s with no retry fallback. - Add traefik-retry middleware to reverse-proxy factory (all services) - Add skip_global_rate_limit variable to both reverse-proxy factories - Create ha-sofia-rate-limit middleware (100 avg, 200 burst) - Apply to ha-sofia and music-assistant (both route to Sofia)	2026-04-05 20:47:58 +03:00
Viktor Barzin	36171bcda4	add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me - Add auth.htpasswd section to config-private.yml - Mount htpasswd file in registry-private container, fix healthcheck for 401 - Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me - Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body) - Add docker to cloudflare_proxied_names (registry stays non-proxied) - Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces - Update infra provisioning to install apache2-utils and generate htpasswd from Vault	2026-03-22 22:10:10 +02:00
Viktor Barzin	73511b1230	extract remaining 19 modules from platform, complete stack split [ci skip] Phase 3: all 27 platform modules now run as independent stacks. Platform reduced to empty shell (outputs only) for backward compat with 72 app stacks that declare dependency "platform". Fixed technitium cross-module dashboard reference by copying file. Woodpecker pipeline applies all 27+1 stacks in parallel via loop. All applied with zero destroys.	2026-03-17 21:42:16 +00:00

12 commits