infra

Author	SHA1	Message	Date
Viktor Barzin	bc866d53fa	[servarr/mam-farming] Tune grabber for MAM's real catalogue ## Context After the Mouse-class unblock on 2026-04-19, end-to-end testing of the grabber revealed three issues with the plan's original filter values: 1. `SEEDER_CEILING=50` rejects ~99% of MAM's catalogue. MAM is a well-seeded private tracker — 100-700 seeders per torrent is normal. A ceiling of 50 makes the filter too tight: across 140 FL torrents sampled in one loop, only 0-1 matched. The intent ("avoid oversupplied swarms") is still valid; the threshold was wrong for MAM's shape. 2. `RATIO_FLOOR=1.2` was sized for Mouse-class defence and is now over-tight. Its job is preventing the death spiral where Mouse-class accounts can't announce, so any grab deepens the ratio hole. Once class > Mouse, MAM serves peer lists normally and demand-first filtering (`leechers>=1`) keeps new grabs upload-positive on average. With ratio sitting at 0.7 post-recovery (we over-downloaded while unblocking), 1.2 was preventing the very grabs that would earn us back to healthy ratio. 3. `parse_size` crashed on `"1,002.9 MiB"`. MAM's pretty-printed sizes use thousands separators; `float("1,002.9")` raises `ValueError`. Every grabber run that hit a ≥1000-MiB candidate on the page crashed with a traceback instead of skipping the size. ## This change - `SEEDER_CEILING`: 50 → 200 — live catalogue evidence showed 50 was rejecting viable demand-first candidates like `Zen and the Art of Motorcycle Maintenance` (S=156, L=1, score=125). - `RATIO_FLOOR`: 1.2 → 0.5 — still a tripwire for catastrophic dips, but no longer a steady-state block. Class == Mouse remains an absolute skip (separate branch). - `parse_size`: `s.replace(",", "").split()` before int-parse. ## Verified post-change Manual grabber loop (5 runs at random offsets) after applying: run=1 parse_size crash on "1,002.9" (this crash motivated fix #3) run=2 GRABBED 3 torrents: Dean and Me: A Love Story (240.7 MiB, S:18, L:1) score=194 Digital Nature Photography (83.7 MiB, S:42, L:1) score=182 Zen and the Art of Motorcycle (830.3 MiB, S:156, L:1) score=125 run=3-5 grabbed=0 at offsets that landed on pages with no matches (expected — MAM returns 20/page, many offsets yield nothing) MAM profile: class=User, ratio=0.7 (recovering from the Mouse unblock), BP=24,053. 28 mam-farming torrents in forcedUP state, actively uploading ~8 MiB to MAM this session across 2 of the Maxximized comic issues. ## What is NOT in this change - No alert threshold changes — `MAMRatioBelowOne` (24h) and `MAMMouseClass` (1h) already handle the "going back to Mouse" case; lowering the floor on the grabber doesn't change alerting. - No janitor changes — the janitor rules are H&R-based and independent of ratio/class state. ## Test plan ### Automated $ cd infra/stacks/servarr && ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 2 changed, 0 destroyed. $ python3 -c 'import ast; ast.parse(open( "infra/stacks/servarr/mam-farming/files/freeleech-grabber.py").read())' ### Manual Verification 1. Trigger the grabber and confirm it doesn't skip-for-ratio at ratio 0.7: $ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1 $ kubectl -n servarr logs job/g1 \| head -5 Profile: ratio=0.7 class=User \| Farming: 33, 2.0 GiB, tracked IDs: 4 Search offset=<random>, found=1323, page_results=20 Added (score=...) ... 2. Repeat 3-5× at different random offsets. Over the course of a 30-min cron cadence, expect 2-5 grabs across the day given MAM's catalogue churn and our filter intersection. ## Reproduce locally cd infra/stacks/servarr ../../scripts/tg plan # expect: 0 to add, 2 to change (configmap + cronjob) ../../scripts/tg apply --non-interactive kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1 kubectl -n servarr logs job/g1 Follow-up: `bd close code-qfs` already completed in the parent commit; this is a post-shipping tune, no beads action needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:46:46 +00:00
Service Upgrade Agent	55ade1f9b3	[servarr] Fix qbittorrent container_port 8787 -> 8080 (matches WEBUI_PORT) Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-19 13:37:44 +00:00
Service Upgrade Agent	094bc727d4	upgrade: qbittorrent 5.0.4 -> 5.1.4 Changelog summary: Minor version bump; patch releases update external Alpine packages and restore qbittorrent-cli openssl3 support. Risk: SAFE Breaking changes: none DB backup: no (not DB-backed) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-19 13:26:15 +00:00
Viktor Barzin	789cb61310	[servarr] Rewrite MAM ratio farming — break Mouse death spiral, adopt in TF ## Context A MAM (MyAnonamouse) freeleech farming workflow was deployed on 2026-04-14 via kubectl apply (outside Terraform). Five days later the account was still stuck in Mouse class: 715 MiB downloaded, 0 uploaded, ratio 0. Tracker responses on 7 of 9 active torrents returned `status=4 \| msg="User currently mouse rank, you need to get your ratio up!"` — MAM was actively refusing to serve peer lists because the account was in Mouse class, and refusing to serve peer lists made the ratio impossible to recover. Meanwhile the grabber kept digging: 501 torrents sat in qBittorrent, 0 completed, 0 bytes uploaded. Root causes (ranked): 1. Death spiral — Mouse class blocks announces, nothing uploads. 2. BP-spender 30 000 BP threshold blocked the only exit even though the account already had 24 500 BP. 3. Grabber selection (`score = 1.0 / (seeders+1)`) preferred low-demand torrents filtered to <100 MiB — ratio-hostile by design. 4. Grabber/cleanup deadlock: cleanup only fired on seed_time > 3d, so torrents that never started never qualified. Combined with the 500- torrent cap this stalled the grabber indefinitely. 5. qBittorrent queueing amplified (4) — 495/501 stuck in queuedDL. 6. Ratio-monitor labelled queued torrents `unknown` (empty tracker field), hiding the problem on the MAM Grafana panel. 7. qBittorrent memory limit (256 Mi LimitRange default) too low. 8. All of the above was Terraform drift with no reviewability. ## This change Introduces `stacks/servarr/mam-farming/` — a new TF module that adopts the three kubectl-applied resources and replaces their scripts with demand-first, H&R-aware logic. Also bumps qBittorrent resources, fixes ratio-monitor labelling, and adds five Prometheus alerts plus a Grafana panel row. ### Architecture MAM API ───┬─── jsonLoad.php (profile: ratio, class, BP) ├─── loadSearchJSONbasic.php (freeleech search) ├─── bonusBuy.php (50 GiB min tier for API) └─── download.php (torrent file) │ Pushgateway <──┬────────────┤ │ mam_ratio ┌────────────────────┐ │ mam_class_code │ freeleech-grabber │ /30 │ mam_bp_balance ◄───│ (ratio-guarded) │ │ mam_farming_ └──────────┬─────────┘ │ mam_janitor_* │ adds to │ ▼ │ Grafana panels qBittorrent (mam-farming) │ + 5 alerts ▲ │ │ deletes by rule │ ┌──────────┴─────────┐ │ ◄───│ farming-janitor │ /15 │ │ (H&R-aware) │ │ └──────────┬─────────┘ │ │ buys credit │ ┌──────────┴─────────┐ └───────────────────────│ bp-spender │ 0 /6 │ (tier-aware) │ └────────────────────┘ ### Key decisions - Ratio guard on grabber — refuse to grab if ratio < 1.2 OR class == Mouse. Prevents the death spiral from deepening. Emits `mam_grabber_skipped_reason{reason=...}` and exits clean. - Demand-first selection — new score formula `leechers3 - seeders0.5 + 200 if freeleech_wedge else 0`; size band 50 MiB – 1 GiB; leecher floor 1; seeder ceiling 50. Picks titles that will actually upload. - Janitor decoupled from grabber — runs every 15 min regardless of the ratio-guard state. Without this, stuck torrents accumulate fastest exactly when the grabber is skipping (Mouse class). H&R-aware: never deletes `progress==1.0 AND seeding_time < 72h`. Six delete reasons observable via `mam_janitor_deleted_per_run{reason=...}`. - BP-spender tier-aware — MAM imposes a hard 50 GiB minimum on API buyers ("Automated spenders are limited to buying at least 50 GB... due to log spam"). Valid API tiers: 50/100/200/500 GiB at 500 BP/GiB. The spender picks the smallest tier that satisfies the ratio deficit AND fits the budget, preserving a 500 BP reserve. If even the 50 GiB tier is too expensive, it skips and retries on the next 6-hour cron. - Authoritative metrics use MAM profile fields — `downloaded_bytes` / `uploaded_bytes` (integers) rather than the pretty-printed `downloaded` / `uploaded` strings like "715.55 MiB" that MAM also returns. - Ratio-monitor category-first labelling — `tracker` is empty for queued torrents that never announced. Now maps `category==mam-farming` to label `mam` first, only falls back to tracker-URL parsing when category is absent. Stops hundreds of MAM torrents collecting under `unknown`. - qBittorrent resources bumped to `requests=512Mi / limits=1Gi` so hundreds of active torrents don't OOM. ### Emergency recovery performed this session 1. Adopted 5 in-cluster resources via root-module `import {}` blocks (Terraform 1.5+ rejects imports inside child modules). 2. Ran the janitor in DRY_RUN=1 to verify rules against live state — 466 `never_started` candidates, 0 false positives in any other reason bucket. Flipped to enforce mode. 3. Janitor deleted 466 stuck torrents (matches plan's ~495 target; 35 preserved as active/in-progress). 4. Truncated `/data/grabbed_ids.txt` so newly-popular titles become eligible again. The ratio is still 0 because the API cannot buy below 50 GiB and the account sits at 24 551 BP (needs 25 000). Manual 1 GiB purchase via the MAM web UI — 500 BP — would immediately lift the account to ratio ≈ 1.4 and unblock announces. Future automation cannot do this for us due to MAMs anti-spam rule. ### What is NOT in this change - qBittorrent prefs reconciliation (max_active_downloads=20, max_active_uploads=150, max_active_torrents=150). The plan wanted this; deferred to a follow-up because the janitor + ratio recovery handles the 500-torrent backlog first. A small reconciler CronJob posting to /api/v2/app/setPreferences is the intended follow-up. - VIP purchase (~100 k BP) — deferred until BP accumulates. - Cross-seed / autobrr — separate initiative. ## Alerts added - P1 MAMMouseClass — `mam_class_code == 0` for 1h - P1 MAMCookieExpired — `mam_farming_cookie_expired > 0` - P2 MAMRatioBelowOne — `mam_ratio < 1.0` for 24h (replaces old QBittorrentMAMRatioLow, now driven by authoritative profile metric) - P2 MAMFarmingStuck — no grabs in 4h while ratio is healthy - P2 MAMJanitorStuckBacklog — `skipped_active > 400` for 6h ## Test plan ### Automated $ cd infra/stacks/servarr && ../../scripts/tg plan 2>&1 \| grep Plan Plan: 5 to import, 2 to add, 6 to change, 0 to destroy. $ ../../scripts/tg apply --non-interactive Apply complete! Resources: 5 imported, 2 added, 6 changed, 0 destroyed. # Re-plan after import block removal (idempotent) $ ../../scripts/tg plan 2>&1 \| grep Plan Plan: 0 to add, 1 to change, 0 to destroy. # The 1 change is a pre-existing MetalLB annotation drift on the # qbittorrent-torrenting Service — unrelated to this change. $ cd ../monitoring && ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 2 changed, 0 destroyed. # Python + JSON syntax $ python3 -c 'import ast; [ast.parse(open(p).read()) for p in [ "infra/stacks/servarr/mam-farming/files/freeleech-grabber.py", "infra/stacks/servarr/mam-farming/files/bp-spender.py", "infra/stacks/servarr/mam-farming/files/mam-farming-janitor.py"]]' $ python3 -c 'import json; json.load(open( "infra/stacks/monitoring/modules/monitoring/dashboards/qbittorrent.json"))' ### Manual Verification 1. Grabber ratio-guard path: $ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1 $ kubectl -n servarr logs job/g1 Skip grab: ratio=0.0 class=Mouse (floor=1.2) reason=mouse_class 2. BP-spender tier path: $ kubectl -n servarr create job --from=cronjob/mam-bp-spender s1 $ kubectl -n servarr logs job/s1 Profile: ratio=0.0 class=Mouse DL=0.70 GiB UL=0.00 GiB BP=24551 \| deficit=1.40 GiB needed=3 affordable=48 buy=0 Done: BP=24551, spent=0 GiB (needed=3, affordable=48) Correctly skips because affordable (48) < smallest API tier (50). 3. Janitor in enforce mode: $ kubectl -n servarr create job --from=cronjob/mam-farming-janitor j1 $ kubectl -n servarr logs job/j1 \| tail -3 Done: deleted=466 preserved_hnr=0 skipped_active=35 dry_run=False per reason: {'never_started': 466, ...} Second run immediately after: `deleted=0 skipped_active=35` — steady state with only active/seeding torrents left. 4. Alerts loaded: $ kubectl -n monitoring get cm prometheus-server \ -o jsonpath='{.data.alerting_rules\.yml}' \ \| grep -E "alert: MAM\|alert: QBittorrent" - alert: MAMMouseClass - alert: MAMCookieExpired - alert: MAMRatioBelowOne - alert: MAMFarmingStuck - alert: MAMJanitorStuckBacklog - alert: QBittorrentDisconnected - alert: QBittorrentMAMUnsatisfied 5. Dashboard: browse to Grafana "qBittorrent - Seeding & Ratio" → new "MAM Profile (from jsonLoad.php)" row at the bottom shows class, BP balance, profile ratio, transfer, BP-vs-reserve timeseries, janitor deletion stacked chart, janitor state stat, grabber state stat. ## Reproduce locally 1. `cd infra/stacks/servarr && ../../scripts/tg plan` — expect 0 add / 1 change (unrelated MetalLB annotation drift). 2. `kubectl -n servarr get cronjobs` — expect three: mam-freeleech-grabber, mam-bp-spender, mam-farming-janitor. 3. Trigger each via `kubectl create job --from=cronjob/<name> <job>` and read logs; outputs match the manual-verification snippets above. Closes: code-qfs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:45:38 +00:00
Viktor Barzin	327ce215b9	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] ## Context Wave 3A (commit `c9d221d5`) added the `# KYVERNO_LIFECYCLE_V1` marker to the 27 pre-existing `ignore_changes = [...dns_config]` sites so they could be grepped and audited. It did NOT address pod-owning resources that were simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18) found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec, and many other stacks showed perpetual `dns_config` drift every plan because their `kubernetes_deployment` / `kubernetes_stateful_set` / `kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all. Root cause (same as Wave 3A): Kyverno's admission webhook stamps `dns_config { option { name = "ndots"; value = "2" } }` on every pod's `spec.template.spec.dns_config` to prevent NxDomain search-domain flooding (see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes` on every Terraform-managed pod-owner, Terraform repeatedly tries to strip the injected field. ## This change Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`, `kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each carries the right `ignore_changes` path: - kubernetes_deployment / stateful_set / daemon_set / job_v1: `spec[0].template[0].spec[0].dns_config` - kubernetes_cron_job_v1: `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is one level deeper) Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2` inline so the suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`. Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`): 1. No existing `lifecycle {}`: inject a brand-new block just before the resource's closing `}`. 108 new blocks on 93 files. 2. Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag` from Wave 4, commit a62b43d1): extend its `ignore_changes` list with the dns_config path. Handles both inline (`= [x]`) and multiline (`= [\n x,\n]`) forms; ensures the last pre-existing list item carries a trailing comma so the extended list is valid HCL. 34 extensions. The script skips anything already mentioning `dns_config` inside an `ignore_changes`, so re-running is a no-op. ## Scale - 142 total lifecycle injections/extensions - 93 `.tf` files touched - 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones - Every Tier 0 and Tier 1 stack with a pod-owning resource is covered - Together with Wave 3A's 27 pre-existing markers → 169 greppable `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo ## What is NOT in this change - `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … /`). Python script touched the file, reverted manually. - `_template/main.tf.example` skeleton — kept minimal on purpose; any future stack created from it should either inherit the Wave 3A one-line form or add its own on first `kubernetes_deployment`. - `terraform fmt` fixes to pre-existing alignment issues in meshcentral, nvidia/modules/nvidia, vault — unrelated to this commit. Left for a separate fmt-only pass. - Non-pod resources (`kubernetes_service`, `kubernetes_secret`, `kubernetes_manifest`, etc.) — they don't own pods so they don't get Kyverno dns_config mutation. ## Verification Random sample post-commit: ``` $ cd stacks/navidrome && ../../scripts/tg plan → No changes. $ cd stacks/f1-stream && ../../scripts/tg plan → No changes. $ cd stacks/frigate && ../../scripts/tg plan → No changes. $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='*.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 169 ``` ## Reproduce locally 1. `git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` → 169+ 3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on the deployment's dns_config field. Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest annotation class handled separately in `8d94688d` for tls_secret) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:19:48 +00:00
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	216d4240c9	[infra] Add Cloudflare provider to all stack lock files and generated providers Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key) and includes cloudflare in required_providers. These are the generated files from running `terragrunt init -upgrade` across all stacks. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:36 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	30cdeefb1c	chore: sync terraform state after nfsvers=4 convergence Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4). State files encrypted and committed. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:20:18 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	a4c80adbce	fix(prowlarr): correct image tag from 1.31.1 to 2.3.5 [ci skip] LinuxServer.io prowlarr uses different version scheme than the agent guessed. Tag 1.31.1 doesn't exist on lscr.io.	2026-04-06 14:55:33 +03:00
Viktor Barzin	09b4bad958	feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline Pin third-party images from :latest to current stable versions: - Platform: cloudflared, technitium, snmp-exporter, pve-exporter, headscale, shadowsocks, xray - Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse, n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe, cyberchef, networking-toolbox, echo, coturn, shlink, affine Enable DIUN annotations on all pinned deployments with per-image tag patterns. Add Woodpecker app-stacks pipeline for selective terragrunt apply on changed app stacks.	2026-04-06 14:27:13 +03:00
Viktor Barzin	9e25441c30	fix: restore changedetection and flaresolverr services - changedetection: increase memory from 64Mi to 256Mi/512Mi (was OOMKilling), set replicas back to 1 - flaresolverr: re-enable with replicas=1, increase memory limit to 1Gi (needed by book-search for Cloudflare bypass)	2026-04-06 14:26:29 +03:00
Viktor Barzin	cb8a808700	feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all remaining single-pod app data services. Deployments updated to use new block storage PVCs. Old NFS modules retained for rollback. Services: affine, changedetection, diun, excalidraw, f1-stream, hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health, onlyoffice, owntracks, paperless-ngx, privatebin, resume, speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy (torrserver), whisper+piper, frigate (config), ollama (ui), servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss (extensions), meshcentral (data+files), openclaw (data+home+ openlobster), technitium, mailserver (data+roundcube html+enigma), dbaas (pgadmin). Strategy set to Recreate where needed for RWO volumes.	2026-04-04 19:25:12 +03:00
Viktor Barzin	95e49134ae	cleanup: remove old audiobook-search, superseded by book-search - Delete servarr/audiobook-search TF module (moved to ebooks/book-search) - Remove audiobook-search from cloudflare_proxied_names - Remove commented-out module reference in servarr/main.tf - Clean up "renamed from" comment in ebooks/main.tf - K8s resources (deploy/svc/ingress) deleted from servarr namespace - Cloudflare DNS record already absent - Import book-search and insta2spotify DNS records into cloudflared state	2026-03-25 23:16:01 +02:00
Viktor Barzin	6e1d8c0c8b	add ebooks stack: consolidate book services into single namespace [ci skip] - New ebooks namespace with CWA, Stacks, Audiobookshelf, book-search - book-search (renamed from audiobook-search) with CWA ingest volume - Comment out audiobook_search module from servarr - All NFS volumes and secrets consolidated	2026-03-25 15:04:27 +02:00
Viktor Barzin	009f4b3b89	change qBittorrent torrent port from 6881 to 50000 Port 6881 is blacklisted by MAM and throttled by ISPs. Also added pfSense NAT rule for 50000 TCP+UDP → 10.0.20.200.	2026-03-25 12:29:00 +02:00
Viktor Barzin	5b5a7d8cb4	add MAM email/password env vars to audiobook-search deployment Reads mam_email and mam_password from Vault secret/servarr via ESO.	2026-03-25 12:03:12 +02:00
Viktor Barzin	c49e4561a3	consolidate MetalLB IPs: 5 → 1 (10.0.20.200) Migrate all 11 LoadBalancer services to share 10.0.20.200: - Update annotations: metallb.universe.tf → metallb.io - Pin all services to 10.0.20.200 with allow-shared-ip: shared - Standardize externalTrafficPolicy to Cluster (required for IP sharing) - Remove redundant port 80 (roundcube) from mailserver LB - Update CoreDNS forward: 10.0.20.204 → 10.0.20.200 - Update cloudflared tunnel target: 10.0.20.202 → 10.0.20.200 Services consolidated: coturn, headscale, kms, qbittorrent, shadowsocks, torrserver, wireguard, mailserver, traefik, xray, technitium	2026-03-24 18:35:43 +02:00
Viktor Barzin	4ca7af8818	add audiobook-search service to servarr stack - New audiobook-search deployment + service + ingress (Authentik-protected) - qBittorrent: add NFS mount for /audiobooks (shared with Audiobookshelf) - Cloudflare DNS: add audiobook-search.viktorbarzin.me - Env vars: QBITTORRENT_URL/PASS, AUDIOBOOKSHELF_URL/TOKEN from ESO	2026-03-24 01:21:49 +02:00
Viktor Barzin	21bb3036af	state(dbaas): update encrypted state	2026-03-19 20:23:59 +00:00
Viktor Barzin	a04335d0f3	right-size 14 services and scale down GPU-heavy workloads [ci skip] Memory right-sizing based on VPA upperBound analysis: - Increases: stirling-pdf 1200→1536Mi, claude-memory 64→128Mi, dawarich 512→768Mi, kyverno-cleanup 128→192Mi, linkwarden 768→1Gi, navidrome 64→128Mi, listenarr 768→896Mi, privatebin 64→128Mi, ntfy 64→128Mi, health 128→256Mi, dbaas quota 16→20Gi, mysql-operator 384→512Mi - Decreases: rybbit 768→384Mi, nvidia-exporter added explicit 192Mi, dcgm-exporter 2560→1536Mi - Scale to 0: ebook2audiobook/audiblez-web, whisper (GPU node pressure) Net effect: -496Mi cluster-wide, 13 ContainerNearOOM alerts resolved, all ResourceQuota pressures cleared, GPU health green.	2026-03-15 23:00:49 +00:00
Viktor Barzin	39b3c51709	migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret Replaced data "vault_kv_secret_v2" with: 1. ExternalSecret (ESO syncs Vault KV → K8s Secret) 2. data "kubernetes_secret" (reads ESO-created secret at plan time) This removes the Vault provider dependency at plan time for these stacks — they now only need K8s API access, not a Vault token. Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection, coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama, owntracks, real-estate-crawler, servarr, ytdlp	2026-03-15 22:06:39 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	23019da8e5	equalize memory req=lim across 70+ containers using Prometheus 7d max data After node2 OOM incident, right-size memory across the cluster by setting requests=limits based on max_over_time(container_memory_working_set_bytes[7d]) with 1.3x headroom. Eliminates ~37Gi overcommit gap. Categories: - Safe equalization (50 containers): set req=lim where max7d well within target - Limit increases (8 containers): raise limits for services spiking above current - No Prometheus data (12 containers): conservatively set lim=req - Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql 4Gi limits across 3 replicas.	2026-03-14 21:46:49 +00:00
Viktor Barzin	a8d944eb9b	migrate all secrets from SOPS to Vault KV - Add vault provider to root terragrunt.hcl (generated providers.tf) - Delete stacks/vault/vault_provider.tf (now in generated providers.tf) - Add 124 variable declarations + 43 vault_kv_secret_v2 resources to vault/main.tf to populate Vault KV at secret/<stack-name> - Migrate 43 consuming stacks to read secrets from Vault KV via data "vault_kv_secret_v2" instead of SOPS var-file - Add dependency "vault" to all migrated stacks' terragrunt.hcl - Complex types (maps/lists) stored as JSON strings, decoded with jsondecode() in locals blocks Bootstrap secrets (vault_root_token, vault_authentik_client_id, vault_authentik_client_secret) remain in SOPS permanently. Apply order: vault stack first (populates KV), then all others.	2026-03-14 17:15:48 +00:00
Viktor Barzin	b00f810d3d	Remove all CPU limits cluster-wide to eliminate CFS throttling CPU limits cause CFS throttling even when nodes have idle capacity. Move to a request-only CPU model: keep CPU requests for scheduling fairness but remove all CPU limits. Memory limits stay (incompressible). Changes across 108 files: - Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers - Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers - Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas - Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice) - RBAC module: remove cpu_limits variable and quota reference - Freedify factory: remove cpu_limit variable and limits reference - 86 deployment files: remove cpu from all limits blocks - 6 Helm values files: remove cpu under limits sections	2026-03-14 08:51:45 +00:00
Viktor Barzin	ce79bd5c04	Add node hang instrumentation and scale down chromium services - Add journald collection to Alloy (loki.source.journal) for kernel OOM, panic, hung task, and soft lockup detection — ships system logs off-node so they survive hard resets - Add 5 Loki alerting rules (KernelOOMKiller, KernelPanic, KernelHungTask, KernelSoftLockup, ContainerdDown) evaluating against node-journal logs - Fix Loki ruler config: correct rules mount path (/var/loki/rules/fake), add alertmanager_url and enable_api - Add Prometheus alerts: NodeMemoryPressureTrending (>85%), NodeExporterDown, NodeHighIOWait (>30%) - Add caretta tolerations for control-plane and GPU nodes - Scale down chromium-based services to 0 for cluster stability: f1-stream, flaresolverr, changedetection, resume/printer	2026-03-13 22:20:28 +00:00
Viktor Barzin	d352d6e7f8	resource quota review: fix OOM risks, close quota gaps, add HA protections Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values	2026-03-08 18:17:46 +00:00
Viktor Barzin	9d031290cc	[ci skip] fix Homepage icons for Tandoor, Listenarr, Networking Toolbox, Goldilocks tandoor.png → tandoor-recipes.png (dashboard-icons), podcast.png → mdi-podcast, networking.png → mdi-lan, goldilocks.png → mdi-scale-balance	2026-03-07 21:29:51 +00:00
Viktor Barzin	f3042f318e	[ci skip] fix widget issues: ports, Immich v2 API, Nextcloud trusted domains - qBittorrent: use service port 80 (not container port 8080) - Immich: add version=2 for new API endpoints (/api/server/*) - Nextcloud: use external URL (internal rejects untrusted Host header) - HA London: remove widget (token expired, needs manual regeneration) - Headscale: remove widget (requires nodeId param, not overview)	2026-03-07 20:39:56 +00:00
Viktor Barzin	57eed07370	[ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma Add API credentials to SOPS and wire homepage_credentials through stacks. Re-add Uptime Kuma widget with new "infra" status page slug.	2026-03-07 20:39:55 +00:00
Viktor Barzin	10acdcd5a2	[ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale Wire homepage_credentials through servarr parent stack for prowlarr. Fix paperless-ngx widget to use internal service URL.	2026-03-07 20:39:55 +00:00
Viktor Barzin	6bd3970579	[ci skip] add Homepage gethomepage.dev annotations to all services Add Kubernetes ingress annotations for Homepage auto-discovery across ~88 services organized into 11 groups. Enable serviceAccount for RBAC, configure group layouts, and add Grafana/Frigate/Speedtest widgets.	2026-03-07 20:39:54 +00:00
Viktor Barzin	1f2c1ca361	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	220aa739ce	[ci skip] migrate servarr sub-stacks + actualbudget factory NFS to CSI PV/PVC Final batch: servarr (aiostreams, listenarr, readarr, soulseek, prowlarr, qbittorrent, lidarr) and actualbudget factory. All use ../../../modules/kubernetes/nfs_volume (3 levels deep).	2026-03-02 02:04:22 +00:00
Viktor Barzin	9e4fb23b10	[ci skip] right-size all pod resources based on VPA + live metrics audit Full cluster resource audit: cross-referenced Goldilocks VPA recommendations, live kubectl top metrics, and Terraform definitions for 100+ containers. Critical fixes: - dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit - stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit - traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi Added explicit resources to ~40 containers that had none: - audiobookshelf, changedetection, cyberchef, dawarich, diun, echo, excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n, navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor, tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver, cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard, k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server, immich-postgresql, osrm-foot GPU containers: added CPU/mem alongside GPU limits: - ollama: removed CPU/mem limits (models vary in size), keep GPU only - frigate: req 500m/2Gi, lim 4/8Gi + GPU - immich-ml: req 100m/1Gi, lim 2/4Gi + GPU Right-sized ~25 over-provisioned containers: - kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi) - onlyoffice: CPU 8 → 2 (VPA upper 45m) - realestate-crawler-api: CPU 2000m → 250m - blog/travel-blog/webhook-handler: 500m → 100m - coturn/health/plotting-book: reduced to match actual usage Conservative methodology: limits = max(VPA upper * 2, live usage * 2)	2026-03-01 19:18:50 +00:00
Viktor Barzin	a1ba218cd2	[ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk Major milestone - shared PostgreSQL moved from NFS to CloudNativePG: - CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility - All 20 databases and 19 roles restored from pg_dumpall backup - postgresql.dbaas Service patched to point at CNPG primary - Old PG deployment scaled to 0 (NFS data intact for rollback) - All 12+ dependent services verified running: authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker, rybbit, affine, health, resume, trading-bot, atuin - Authentik PgBouncer working through the switched endpoint TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob	2026-02-28 19:08:06 +00:00
Viktor Barzin	eb32190461	[ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik - osrm-bicycle: 1Gi limit (loads 403MB routing graph) - aiostreams: 768Mi limit (loads 44K anime entries) - listenarr: 1Gi limit (.NET + Playwright/Chromium) - authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn) - servarr: pass nfs_server variable to all submodules	2026-02-28 17:03:33 +00:00
Viktor Barzin	89a6e08245	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	c7c7047f1c	[ci skip] Flatten module wrappers into stack roots Remove the module "xxx" { source = "./module" } indirection layer from all 66 service stacks. Resources are now defined directly in each stack's main.tf instead of through a wrapper module. - Merge module/main.tf contents into stack main.tf - Apply variable replacements (var.tier -> local.tiers.X, renamed vars) - Fix shared module paths (one fewer ../ at each level) - Move extra files/dirs (factory/, chart_values, subdirs) to stack root - Update state files to strip module.<name>. prefix - Update CLAUDE.md to reflect flat structure Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.	2026-02-22 15:13:55 +00:00
Viktor Barzin	e6420c7b36	[ci skip] Move Terraform modules into stack directories Move all 88 service modules (66 individual + 22 platform) from modules/kubernetes/<service>/ into their corresponding stack directories: - Service stacks: stacks/<service>/module/ - Platform stack: stacks/platform/modules/<service>/ This collocates module source code with its Terragrunt definition. Only shared utility modules remain in modules/kubernetes/: ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy. All cross-references to shared modules updated to use correct relative paths. Verified with terragrunt run --all -- plan: 0 adds, 0 destroys across all 68 stacks.	2026-02-22 14:38:14 +00:00
Viktor Barzin	a9ba8899be	[ci skip] Phase 3: Create 66 service stacks and migrate state Generated individual stack directories for all 66 services under stacks/. Each stack has terragrunt.hcl (depends on platform) and main.tf (thin wrapper calling existing module). Migrated all 64 active service states from root terraform.tfstate to individual state files. Root state is now empty. Verified with terragrunt plan on multiple stacks (no changes).	2026-02-22 13:56:34 +00:00

44 commits