infra

Author	SHA1	Message	Date
Viktor Barzin	cdb7d9a81a	keel: sweep KEEL_LIFECYCLE_V1 + per-container KEEL_IGNORE_IMAGE across enrolled workloads Every Keel-enrolled workload (policy=patch, match-tag=true, injected by the inject-keel-annotations Kyverno policy) was fighting Terraform: Keel rewrites the image tag and restamps keel.sh/update-time, change-cause and the rollout revision on each poll; without ignore_changes every `tg apply` reverted those — downgrading the image and forcing a spurious rollout that Keel then re-did. Only llama-cpp had the full block (added 2026-05-24); the other ~73 workloads drifted. This sweep adds, to every enrolled deployment/daemonset lifecycle: - container[N].image (one per container index + init_container[N]) # KEEL_IGNORE_IMAGE - keel.sh/match-tag, keel.sh/update-time, kubernetes.io/change-cause, deployment.kubernetes.io/revision # KEEL_LIFECYCLE_V1 Verified via `tg plan` on speedtest (single-container: image downgrade 0.24.3->0.24.1 + annotation strip now gone) and changedetection (multi-container: both container images no longer drift). AGENTS.md drift-suppression section updated with the canonical block + marker legend. fire-planner deferred (parallel session mid-apply per presence board). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 23:09:30 +00:00
Viktor Barzin	60b2b1cdfc	cluster-health: emergency-stop Keel + roll back image downgrades + quota raises Keel was rewriting tag strings (not just digests) despite the keel.sh/match-tag=true annotation injected by the Kyverno inject-keel-annotations ClusterPolicy. That annotation was supposed to constrain Keel to digest-only watches under the deployment's CURRENT tag. It didn't. Casualties confirmed today (live image rewritten to a lower version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into SQLite mode and can't read the v2 db-config.json → MariaDB store); n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop); beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate); plus historical ones previously fixed (claude-memory :71b32438 → :17, forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1). Changes: * stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to 0/0. Keep off until either match-tag is root-caused or every enrolled workload migrates to a content-addressed (SHA) pin. * stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2, bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the deployment label (matches Kyverno's exclude rule so the inject-keel- annotations ClusterPolicy stops mutating) AND the annotation (so Keel itself respects). Removed keel.sh/policy from lifecycle.ignore_changes so TF owns it as `never` and can't drift back to `force`. * stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73 on both seed-config and workbench containers (was :latest, Keel rolled to :0.1.0). * stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated by Keel from the prior live :3.2.1). * stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to 100% and blocked every new pod create with FailedCreate. Raising the cap unblocked the four affected DaemonSets in one shot. * stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory 32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's face-detection burst behaviour. * stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07 (matches the 21 other stacks that already declare it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 18:48:50 +00:00
Viktor Barzin	9277d71d81	nfs-mirror: append transferred files to offsite-sync manifest Some checks failed ci/woodpecker/push/default Pipeline is running Details ci/woodpecker/push/build-cli Pipeline failed Details Step 1 of offsite-sync-backup is incremental on non-monthly days, driven by /mnt/backup/.changed-files which only daily-backup wrote to. nfs-mirror's writes were therefore invisible to Step 1 until the next monthly --delete pass — which would also wipe data pre-positioned on Synology pve-backup/ (e.g. the in-place btrfs rename we just did to relocate ~160G of NFS subtrees from /Backup/Viki/nfs/<svc>/ to /Backup/Viki/pve-backup/<svc>/). Fix: snapshot a timestamp before rsync, then after rsync use `find -newer $STAMP -type f -printf '%P\n'` to enumerate every file nfs-mirror created/modified and append to the manifest. Paths are relative to /mnt/backup/ (matches Step 1 --files-from expectation). State files are excluded. The current in-flight first run started before this patch was deployed, so its writes won't auto-populate the manifest — a one-off manual backfill will be done after it completes.	2026-05-24 15:32:22 +00:00
Viktor Barzin	d76f4c4827	n8n: cut memory request 1Gi → 512Mi (+ image bump 1.80.0 → 1.80.5) Right-sizing per krr report (2026-05-22). Image bump syncs main.tf with the live Keel-managed version to avoid an inadvertent downgrade on apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:03:28 +00:00
Viktor Barzin	7bcd3f745c	ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698 )	2026-05-16 13:47:13 +00:00
Viktor Barzin	70d0623b21	ci: retrigger apply for pending Keel enrollment (~58 stacks) Bulk enrollment commit `8f4b1956` had its CI pipeline #689 killed before terragrunt apply ran. The enrollment label + V2 lifecycle changes are in master but never reached the cluster. Appending a one-line marker to each pending stack's main.tf so Woodpecker's diff-detection picks them up and applies them serially. Idempotent — re-applying a stack whose state already matches is a no-op. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 13:42:57 +00:00
Viktor Barzin	8f4b19565c	recruiter-responder: bump image_tag to 189ef901 OpenClaw can now answer 'what do we know about <company>?' from cache via the new recruiter_company_research tool, and recruiter_get embeds the cached research payload inline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 12:41:05 +00:00
Viktor Barzin	53657d9952	infra: document auth = "app\|none" tier on every legacy ingress Sweep through the 30+ stacks that predated the auth = "app" tier and were tagged auth = "none" without a comment explaining why they weren't behind Authentik. Each is now self-documenting at the call site, so the tg-level anti-exposure guard passes and future readers don't have to reverse-engineer the intent. Flipped 6 stacks from "none" to "app" — their backends have their own user auth and the new tier records that more accurately: - navidrome (Subsonic user/password) - ntfy (deny-all default + user.db tokens) - nextcloud (WebDAV/CalDAV/CardDAV app passwords) - vaultwarden (Bitwarden-compatible token auth) - headscale (OIDC + preauth keys for Tailscale nodes) - paperless-ngx (app-layer login + API tokens) Kept "none" with a comment on the rest — they're genuinely public, webhook receivers, native-protocol endpoints, OAuth callbacks, or Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt), claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api, fire-planner /api, forgejo (git/OCI native clients), frigate (HA integration), immich/frame, insta2spotify /api, instagram-poster (meta fetcher), k8s-portal, matrix (native bearer), monitoring×2 (HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT), owntracks (HTTP Basic), postiz, privatebin (client-side enc), rybbit (analytics tracker), send (E2E file drop), tuya-bridge (API key), vault (own auth + CLI), webhook_handler, woodpecker (forgejo webhooks + OAuth), xray (×3 VPN transports). real-estate-crawler/main.tf:400 already had its comment from a prior edit — not touched here. No live state changes — auth = "app" produces the same middleware chain as auth = "none" (verified earlier this session). This commit is purely documentation + intent-tagging.	2026-05-11 19:25:48 +00:00
Viktor Barzin	bf752dffa5	fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes After fixing the threshold=80% misconfig and seeing two PVCs (prometheus + technitium primary) get stuck Terminating, a 3rd round showed four more PVCs (frigate, hackmd, immich-postgresql, paperless-ngx) in the same state. Same root cause: TF spec'd a smaller storage size than the autoresizer-grown live value, K8s rejected the shrink, TF force-replaced the PVC, and the pvc-protection finalizer held it in Terminating while the pod kept using the underlying volume. Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests] on every kubernetes_persistent_volume_claim block that has resize.topolvm.io/threshold annotations. The pattern was already documented in .claude/CLAUDE.md but ~63 stacks were missing it. Live PVCs are unaffected; this only prevents future TF applies from attempting the destroy+recreate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 21:57:01 +00:00
Viktor Barzin	fecfa211fd	fix: pvc-autoresizer threshold should be 10%, not 80% topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE percentage below which expansion fires (per upstream README). Setting it to "80%" means "expand when free-space drops below 80%", i.e. as soon as the PVC crosses 20% utilization — which caused prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi in 70 minutes (six 10% bumps, all when the volume was only ~14% used). Once the SC opt-in fix landed (`1e4eac53`) and the inode metrics fix landed (`02a12f1a`), the autoresizer started actively misfiring across 75+ PVCs cluster-wide. Flip the value to "10%" everywhere — that's "expand when free-space drops below 10%", i.e. at 90% utilization, which is the conventional semantic and matches the alert thresholds in prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp at 95%). The CLAUDE.md PVC template was the source of the misconfig, so update it too. Live PVC annotations were patched in parallel via kubectl annotate; TF apply on each affected stack will be a no-op against those live values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 19:56:16 +00:00
Viktor Barzin	e4f806abe3	ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default false → unprotected) variable in `modules/kubernetes/ingress_factory` with `auth = string` enum (default "required" → fail-closed). Touches every ingress_factory caller so the audit decision is recorded explicitly in code. ingress_factory (Phase 3): - `auth = "required"`: standard Authentik forward-auth (the legacy `protected = true` semantic). - `auth = "public"`: forward-auth via the new `authentik-forward-auth-public` middleware → dedicated public outpost → guest auto-bind. Logged-in users keep their real identity. - `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost itself. - `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated ingresses don't need anti-AI noise; the auth flow already discourages bots). Audit pass (Phase 4) across 96 ingress_factory call sites: - 49 explicit `protected = true` → `auth = "required"` - 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3) - 64 previously-default (no protected line) → `auth = "required"` ADDED, then reviewed individually: * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack, homepage, wrongmove UI, privatebin) → `auth = "none"` * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC, xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich location ingestion, immich frame kiosk, headscale CP, send anonymous drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) → `auth = "none"` * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal UIs, services without app-level auth) - Smoke-test promotions to `auth = "public"`: fire-planner public UI, k8s-portal API, insta2spotify callback. Three call sites in wrapper modules (`stacks/freedify/factory/`, `stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected` bool — they translate to `auth` internally, out of scope for this rename. Behavior change: previously-default ingresses now fail closed (require Authentik login) unless explicitly flipped to `auth = "none"` or `auth = "public"`. This is the audit goal — no more accidentally-unprotected surfaces. Sites that were intentionally public (Anubis content, native APIs, webhooks) are now explicitly recorded as `auth = "none"`. Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via `terraform fmt -recursive` during the audit. Behavior-neutral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 18:53:49 +00:00
Viktor Barzin	8f0502230b	grafana: env-var datasources + reloader so Vault rotations stop breaking dashboards Wealth, Payslips, and Job-Hunter Grafana datasources all baked the rotating PG password into their ConfigMap at TF-apply time, so every 7-day Vault static-role rotation silently broke the panels until a manual `terragrunt apply`. Same family as the recurring grafana-mysql backend bug — Grafana caches creds at startup and never picks up the new ESO-synced password without a restart. Fix: - Each source stack now creates an ExternalSecret in `monitoring` exposing the rotating password as `<NAME>_PG_PASSWORD` env-var. - Grafana mounts those via `envFromSecrets` (optional=true so a missing source stack doesn't block boot) and the datasource ConfigMaps reference `$__env{<NAME>_PG_PASSWORD}` instead of a literal password. - `reloader.stakater.com/auto: "true"` on the Grafana pod restarts it whenever any of the four DB-cred Secrets is updated. Tested end-to-end: forced `vault write -force database/rotate-role/ pg-wealthfolio-sync` → ESO synced (~30s) → reloader fired → Grafana booted with new env in ~50s total → all three /api/datasources /uid/*/health endpoints return "Database Connection OK". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 17:38:38 +00:00
root	f89205a979	Woodpecker CI deploy [CI SKIP]	2026-05-09 13:39:14 +00:00
Viktor Barzin	47bb175a4f	n8n: real-time training loop + decoupled posting instagram-approval: after every tap, immediately fetch /candidates?limit=1 and send the next photo as a fresh inline-keyboard message — the user's tap chains back into this same workflow, so the loop is user-paced. When the pool is exhausted, send an 'all caught up' summary with the backlog count + cumulative training stats. instagram-discover: cron throttled from every-30-min to daily 09:00. The chain handles ongoing training; the daily run only kickstarts a session if the user hasn't been tapping. Limit reduced from 3 → 1 so each kickstart sends a single photo (chain takes over).	2026-05-09 13:37:53 +00:00
Viktor Barzin	77a84ae5e0	ig-poster b17a9737 + n8n discover rewritten to use /candidates with CLIP scoring	2026-05-09 13:31:28 +00:00
Viktor Barzin	a9ee9cee60	ig-poster: 69e395f2 + sync IMMICH_PG_* via ESO for CLIP scoring; postiz publish-notify n8n workflow	2026-05-09 13:16:24 +00:00
Viktor Barzin	352586f711	ig-poster: pivot to Telegram-only delivery (manual IG upload) User dropped Postiz/Instagram OAuth (Meta Business Account flagged + Postiz scope drift). New pipeline ends at Telegram — full-quality JPEG delivered to the bot chat, manually uploaded to IG by the user. - Image bumped to 25e46efd: adds /deliver/{asset_id} endpoint that multipart-uploads to Telegram (URL-fetch fails through Cloudflare for >5MB), then tags 'posted' in Immich. - ESO now syncs telegram_bot_token + telegram_chat_id from Vault. - Public ingress paths grow to ['/image', '/original'] (Authentik bypass on /original is harmless — files are user-tagged, low blast radius — and useful for ad-hoc browser downloads). - Memory limit 512Mi -> 1500Mi: full-resolution Pillow HEIC decode was OOMing on 12MP+ phone photos. - discover.json simplified to scan -> deliver per item; approval and post workflows already deactivated. Telegram bot webhook removed.	2026-05-09 10:45:02 +00:00
Viktor Barzin	7f7698991e	postiz + n8n: real DB URL + webhook-trigger approval - postiz: set DATABASE_URL/REDIS_URL pointing at the bundled subcharts; the chart does NOT auto-wire even when postgresql.enabled=true, so the prisma db:push was failing with empty DATABASE_URL. - n8n approval workflow: swap telegramTrigger -> webhook node so it works without an n8n-stored Telegram credential. Telegram bot's webhook is set via setWebhook to https://n8n.viktorbarzin.me/webhook/instagram-approval. Parse-callback Code node tolerates both shapes ({body:{callback_query:...}} vs {callback_query:...}) so a future move back to telegramTrigger doesn't break.	2026-05-09 00:55:19 +00:00
Viktor Barzin	7a93e09d2f	add postiz + instagram-poster stacks for IG Stories pipeline New stacks: - stacks/postiz/ — Postiz scheduler (Helm chart v1.0.5, image v2.21.7) with bundled PG/Redis, /uploads PVC on proxmox-lvm, JWT_SECRET via ESO from secret/instagram-poster. - stacks/instagram-poster/ — custom Python service that polls Immich for the 'instagram' tag, reformats photos to 9:16 with blurred-bg letterbox, exposes /image/<asset_id> publicly so Postiz can fetch. Image: forgejo.viktorbarzin.me/viktor/instagram-poster. n8n: 3 new workflows (discover, approval, post) for the Telegram inline-button approval UX. Adds ExternalSecret + env vars for TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, IMMICH_API_KEY, plus static URLs for the new service. Vault: seed secret/instagram-poster with telegram_bot_token, telegram_chat_id, immich_api_key, postiz_api_token, postiz_jwt_secret before applying.	2026-05-09 00:07:44 +00:00
Viktor Barzin	e7ce545da2	[job-hunter] Add infra stack + Grafana dashboard + n8n digest workflow New service stack at stacks/job-hunter/ mirroring the payslip-ingest pattern: per-service CNPG database + role (via dbaas null_resource), Vault static role pg-job-hunter (7d rotation), ExternalSecrets for app secrets and DB creds, Deployment with alembic-migrate init container, ClusterIP Service, Grafana datasource ConfigMap. Grafana dashboard job-hunter.json in Finance folder: new roles per day, source breakdown, top companies, GBP salary distribution, recent roles table (sorted by parse confidence then salary). n8n weekly-digest workflow calls POST /digest/generate with bearer auth every Monday 07:00 London; digest_runs table provides idempotency. Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:09:29 +00:00
Viktor Barzin	327ce215b9	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] ## Context Wave 3A (commit `c9d221d5`) added the `# KYVERNO_LIFECYCLE_V1` marker to the 27 pre-existing `ignore_changes = [...dns_config]` sites so they could be grepped and audited. It did NOT address pod-owning resources that were simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18) found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec, and many other stacks showed perpetual `dns_config` drift every plan because their `kubernetes_deployment` / `kubernetes_stateful_set` / `kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all. Root cause (same as Wave 3A): Kyverno's admission webhook stamps `dns_config { option { name = "ndots"; value = "2" } }` on every pod's `spec.template.spec.dns_config` to prevent NxDomain search-domain flooding (see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes` on every Terraform-managed pod-owner, Terraform repeatedly tries to strip the injected field. ## This change Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`, `kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each carries the right `ignore_changes` path: - kubernetes_deployment / stateful_set / daemon_set / job_v1: `spec[0].template[0].spec[0].dns_config` - kubernetes_cron_job_v1: `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is one level deeper) Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2` inline so the suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`. Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`): 1. No existing `lifecycle {}`: inject a brand-new block just before the resource's closing `}`. 108 new blocks on 93 files. 2. Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag` from Wave 4, commit a62b43d1): extend its `ignore_changes` list with the dns_config path. Handles both inline (`= [x]`) and multiline (`= [\n x,\n]`) forms; ensures the last pre-existing list item carries a trailing comma so the extended list is valid HCL. 34 extensions. The script skips anything already mentioning `dns_config` inside an `ignore_changes`, so re-running is a no-op. ## Scale - 142 total lifecycle injections/extensions - 93 `.tf` files touched - 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones - Every Tier 0 and Tier 1 stack with a pod-owning resource is covered - Together with Wave 3A's 27 pre-existing markers → 169 greppable `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo ## What is NOT in this change - `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … /`). Python script touched the file, reverted manually. - `_template/main.tf.example` skeleton — kept minimal on purpose; any future stack created from it should either inherit the Wave 3A one-line form or add its own on first `kubernetes_deployment`. - `terraform fmt` fixes to pre-existing alignment issues in meshcentral, nvidia/modules/nvidia, vault — unrelated to this commit. Left for a separate fmt-only pass. - Non-pod resources (`kubernetes_service`, `kubernetes_secret`, `kubernetes_manifest`, etc.) — they don't own pods so they don't get Kyverno dns_config mutation. ## Verification Random sample post-commit: ``` $ cd stacks/navidrome && ../../scripts/tg plan → No changes. $ cd stacks/f1-stream && ../../scripts/tg plan → No changes. $ cd stacks/frigate && ../../scripts/tg plan → No changes. $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='*.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 169 ``` ## Reproduce locally 1. `git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` → 169+ 3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on the deployment's dns_config field. Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest annotation class handled separately in `8d94688d` for tls_secret) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:19:48 +00:00
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	99180bec42	[n8n] Fix broken DIUN auto-upgrade pipeline — missing auth token to claude-agent-service ## Context DIUN has been detecting image updates and firing Slack + webhook notifications for weeks, but zero automated upgrades ran because the handoff from n8n to claude-agent-service was silently 401-ing. The pipeline (DIUN → n8n webhook → claude-agent-service /execute → service-upgrade agent) was migrated from DevVM SSH to K8s HTTP in `42f1c3cf`. The migration wired `claude-agent-service` (API_BEARER_TOKEN env set), updated the n8n workflow JSON to POST with `Authorization: Bearer $env.CLAUDE_AGENT_API_TOKEN`, but missed two things on the n8n side: 1. The deployment didn't expose `CLAUDE_AGENT_API_TOKEN` to the n8n container — workflow sent `Authorization: Bearer ` (empty). 2. The workflow header expression used JS concat (`='Bearer ' + $env.X`) which n8n 1.x does NOT evaluate in HTTP Request node header params. It needs template-literal form: `=Bearer {{ $env.X }}`. Evidence: `claude-agent-service` logs showed only `/health` probes — zero `/execute` calls over 12h despite DIUN firing webhooks. n8n PG execution 2250 returned `401 Missing bearer token`. ## This change - Adds ExternalSecret `claude-agent-token` in the `n8n` namespace that pulls `api_bearer_token` from Vault `secret/claude-agent-service` (same source as the receiving service's token). - Wires the token into the n8n container as env var `CLAUDE_AGENT_API_TOKEN` via `secret_key_ref`. - Sets `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` so expressions CAN read `$env.*` at all (default in 1.x is false already, but setting explicitly guards against upstream default flips). - Fixes the workflow JSON backup (`workflows/diun-upgrade.json`) header expression to use `{{ $env.X }}` template syntax. The live workflow in n8n's PG DB was also patched in place (one-time `UPDATE workflow_entity SET nodes = REPLACE(...)` — workflows are not TF-managed; they were imported once). ## What is NOT in this change - No retroactive re-run of skipped DIUN events. They'll be rediscovered in future scans. - No change to the `claude-agent-service` side — its token and endpoint were already correct. - No Slack alert on n8n HTTP-node failures — future work; right now a broken workflow fails silently unless you check Execution History. ## End-to-end verification ``` $ curl -X POST n8n.viktorbarzin.me/webhook/30805ab6-... \ -d '{"diun_entry_status":"update","diun_entry_image":"docker.io/library/httpd","diun_entry_imagetag":"2.4.66",...}' {"message":"Workflow was started"} HTTP 200 # n8n PG: execution_entity latest row → status=success # claude-agent-service logs → "POST /execute HTTP/1.1" 202 Accepted ``` ## Reproduce locally ``` 1. vault login -method=oidc 2. cd stacks/n8n && ../../scripts/tg apply 3. kubectl -n n8n exec deploy/n8n -- printenv CLAUDE_AGENT_API_TOKEN (should print 64-char hex) 4. Fire synthetic webhook with non-critical image (httpd / alpine) 5. Check n8n execution is success, claude-agent-service shows 202 ``` Closes: code-ekz Related: code-bck Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:41:09 +00:00
Viktor Barzin	42f1c3cf4f	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP ## Context The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API for running Claude headless agents. Three workflows still SSH'd to the DevVM (10.0.10.10) to invoke `claude -p`. This eliminates that dependency. ## This change Pipeline migrations (SSH → HTTP POST to claude-agent-service): - `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation - `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON construction of TODO payloads - `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install - `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault secret/n8n) Documentation updates: - `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s - `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action - `AGENTS.md` — pipeline description updated ## What is NOT in this change - DevVM decommissioning (still hosts terminal/foolery services) - Removal of SSH key secrets from Vault (kept for rollback) - n8n workflow import (must be done manually in n8n UI) [ci skip] Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 10:12:02 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	216d4240c9	[infra] Add Cloudflare provider to all stack lock files and generated providers Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key) and includes cloudflare in required_providers. These are the generated files from running `terragrunt init -upgrade` across all stacks. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:36 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	c33f597111	feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN Pipeline: DIUN detects new image versions every 6h → webhook to n8n → n8n filters (skip databases/custom/infra/:latest) and rate-limits (max 5/6h) → SSH to dev VM → claude -p runs upgrade agent. Agent workflow: resolve GitHub repo → fetch changelogs → classify risk (SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push → wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on failure. Changes: - stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict - stacks/n8n/workflows: version-controlled n8n workflow backup - docs/architecture/automated-upgrades.md: full pipeline documentation - AGENTS.md: add upgrade agent section - service-catalog.md: update DIUN description Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:38:27 +00:00
Viktor Barzin	8b004c4c94	feat(storage): migrate all sensitive services to proxmox-lvm-encrypted Reconcile Terraform with cluster state after manual encrypted PVC migrations and complete the remaining unfinished migrations. All services storing sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI plugin. ## Context Only Technitium DNS was using encrypted storage in Terraform. Many services had been manually migrated to encrypted PVCs in the cluster, but Terraform was never updated — creating dangerous state drift where a `tg apply` could recreate unencrypted PVCs. ## This change Phase 0 — Infrastructure: - Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters) - Add ExternalSecret for LUKS encryption passphrase to Terraform - Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`) with 1280Mi limit for LUKS2 Argon2id key derivation Phase 1 — TF state reconciliation (zero downtime): - Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import - Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates Phase 2 — Data migration (encrypted PVCs existed but unused): - Headscale, Frigate, MeshCentral: rsync + switchover - Nextcloud (20Gi): rsync + chart_values update Phase 3 — New encrypted PVCs: - Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover Phase 4 — Cleanup: - Deleted 5 orphaned unencrypted PVCs ## Services migrated (18 PVCs across 14 namespaces) ``` vaultwarden → vaultwarden-data-encrypted dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted nextcloud → nextcloud-data-encrypted forgejo → forgejo-data-encrypted matrix → matrix-data-encrypted n8n → n8n-data-encrypted affine → affine-data-encrypted health → health-uploads-encrypted hackmd → hackmd-data-encrypted redis → redis-data-redis-node-{0,1} headscale → headscale-data-encrypted frigate → frigate-config-encrypted meshcentral → meshcentral-{data,files}-encrypted ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:15:30 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	09b4bad958	feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline Pin third-party images from :latest to current stable versions: - Platform: cloudflared, technitium, snmp-exporter, pve-exporter, headscale, shadowsocks, xray - Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse, n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe, cyberchef, networking-toolbox, echo, coturn, shlink, affine Enable DIUN annotations on all pinned deployments with per-image tag patterns. Add Woodpecker app-stacks pipeline for selective terragrunt apply on changed app stacks.	2026-04-06 14:27:13 +03:00
Viktor Barzin	cb8a808700	feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all remaining single-pod app data services. Deployments updated to use new block storage PVCs. Old NFS modules retained for rollback. Services: affine, changedetection, diun, excalidraw, f1-stream, hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health, onlyoffice, owntracks, paperless-ngx, privatebin, resume, speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy (torrserver), whisper+piper, frigate (config), ollama (ui), servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss (extensions), meshcentral (data+files), openclaw (data+home+ openlobster), technitium, mailserver (data+roundcube html+enigma), dbaas (pgadmin). Strategy set to Recreate where needed for RWO volumes.	2026-04-04 19:25:12 +03:00
Viktor Barzin	21cfa8c072	bump memory limits for OOM-prone services FreshRSS: 64Mi → 256Mi (171 restarts, VPA upper ~204Mi) Actual Budget HTTP API: 128Mi → 384Mi (17 restarts, VPA upper ~297Mi) n8n: 768Mi → 1Gi (18 restarts, VPA upper ~765Mi) Dawarich: 768Mi → 896Mi (2 restarts, VPA upper ~628Mi) Traefik: 384Mi → 768Mi (2 restarts, VPA upper ~584Mi)	2026-03-21 11:12:12 +00:00
Viktor Barzin	06a0d0599a	regenerate providers.tf: remove vault_root_token variable [ci skip]	2026-03-15 21:21:01 +00:00
Viktor Barzin	0f262ceda3	add pod dependency management via Kyverno init container injection Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation and injects busybox init containers that block until each dependency is reachable (nc -z). Annotations added to 18 stacks (24 deployments). Includes graceful-db-maintenance.sh script for planned DB maintenance (scales dependents to 0, saves replica counts, restores on startup).	2026-03-15 19:17:57 +00:00
Viktor Barzin	1acf8cc4e8	migrate consuming stacks to ESO + remove k8s-dashboard static token Phase 9: ExternalSecret migration across 26 stacks: Fully migrated (vault data source removed, ESO delivers secrets): - speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor - n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge - hackmd (ESO template for DB URL), health (ESO template for DB URL) - trading-bot (ESO template for DATABASE_URL + 7 secret env vars) - forgejo (removed unused vault data source) Partially migrated (vault kept for plan-time, ESO added for runtime): - immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage) - claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs) - woodpecker, openclaw, resume (plan-time in helm values/jobs/modules) 17 stacks unchanged (all plan-time: homepage annotations, configmaps, module inputs) — vault data source works with OIDC auth. Phase 17a: Remove k8s-dashboard static admin token secret. Users now get tokens via: vault write kubernetes/creds/dashboard-admin	2026-03-15 19:05:04 +00:00
Viktor Barzin	683f255c2e	fix: increase memory for OOMKilled services - hackmd: 64Mi -> 256Mi (Node.js app OOMKilled after 14min) - n8n: limit 512Mi -> 768Mi (DB timeouts at 88% mem usage) - speedtest: 128Mi -> 256Mi (OOMKilled during startup) - shlink: limit 512Mi -> 768Mi (OOMKilled after startup)	2026-03-15 10:26:11 +00:00
Viktor Barzin	6f562b5da6	add vaultwarden daily backup CronJob to NFS SQLite backup via Online Backup API + copy of RSA keys, attachments, sends, and config. 30-day retention with rotation. Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.	2026-03-15 00:03:59 +00:00
Viktor Barzin	a8d944eb9b	migrate all secrets from SOPS to Vault KV - Add vault provider to root terragrunt.hcl (generated providers.tf) - Delete stacks/vault/vault_provider.tf (now in generated providers.tf) - Add 124 variable declarations + 43 vault_kv_secret_v2 resources to vault/main.tf to populate Vault KV at secret/<stack-name> - Migrate 43 consuming stacks to read secrets from Vault KV via data "vault_kv_secret_v2" instead of SOPS var-file - Add dependency "vault" to all migrated stacks' terragrunt.hcl - Complex types (maps/lists) stored as JSON strings, decoded with jsondecode() in locals blocks Bootstrap secrets (vault_root_token, vault_authentik_client_id, vault_authentik_client_secret) remain in SOPS permanently. Apply order: vault stack first (populates KV), then all others.	2026-03-14 17:15:48 +00:00
Viktor Barzin	2be858f616	fix: eliminate memory overcommit to prevent node OOM crashes Set requests = limits (Guaranteed QoS) across LimitRange defaults and explicit pod resources. Node2 crashed 2026-03-14 from 250% memory overcommit (61GB limits on 24GB node). Changes: - LimitRange: default = defaultRequest for all 6 tiers - Grafana: 3 → 2 replicas - Grampsweb: document why replicas=0 - Prometheus: 1Gi/4Gi → 3Gi/3Gi - OpenClaw: 512Mi/2Gi → 768Mi/768Mi - Immich server: 256Mi/2Gi → 512Mi/512Mi - Immich postgresql: 256Mi/1Gi → 512Mi/512Mi - Calibre: 256Mi/1536Mi → 256Mi/256Mi - Linkwarden: 256Mi/1536Mi → 768Mi/768Mi - N8N: 256Mi/1Gi → 512Mi/512Mi - MySQL cluster: 1Gi/3-4Gi → 2Gi/2Gi - pg-cluster (CNPG): 512Mi/4Gi → 512Mi/512Mi - DBaaS ResourceQuota limits.memory: 64Gi → 12Gi [ci skip]	2026-03-14 16:01:41 +00:00
Viktor Barzin	b00f810d3d	Remove all CPU limits cluster-wide to eliminate CFS throttling CPU limits cause CFS throttling even when nodes have idle capacity. Move to a request-only CPU model: keep CPU requests for scheduling fairness but remove all CPU limits. Memory limits stay (incompressible). Changes across 108 files: - Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers - Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers - Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas - Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice) - RBAC module: remove cpu_limits variable and quota reference - Freedify factory: remove cpu_limit variable and limits reference - 86 deployment files: remove cpu from all limits blocks - 6 Helm values files: remove cpu under limits sections	2026-03-14 08:51:45 +00:00
Viktor Barzin	76a4987eef	[ci skip] add Forgejo task pipeline for OpenClaw AI agent Forgejo issues as a task queue for OpenClaw: - Forgejo OAuth2 with Authentik SSO, self-registration disabled - Webhook-triggered task processing (instant) + CronJob backup (5min poll) - Tasks processed via Mistral Large 3 (NVIDIA NIM API) - Results posted as issue comments, auto-labeled and closed - Comment follow-ups and reopened issues supported - n8n RBAC for OpenClaw pod exec (future workflow integration)	2026-03-07 21:11:07 +00:00
Viktor Barzin	6bd3970579	[ci skip] add Homepage gethomepage.dev annotations to all services Add Kubernetes ingress annotations for Homepage auto-discovery across ~88 services organized into 11 groups. Enable serviceAccount for RBAC, configure group layouts, and add Grafana/Frigate/Speedtest widgets.	2026-03-07 20:39:54 +00:00
Viktor Barzin	1f2c1ca361	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	197cef7f3f	[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache - tiers.tf: Terragrunt-generated tier locals for all standalone stacks - .planning/: resource audit research and plans - docs/plans/: cluster hardening design doc - redis-25.3.2.tgz: Bitnami Redis Helm chart cache	2026-03-06 23:55:57 +00:00
Viktor Barzin	79a2aa3784	[ci skip] migrate 29 services from inline NFS to CSI-backed PV/PVC Batch migration of all single-volume and simple multi-volume stacks. All services verified healthy after migration. Uses nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options to eliminate stale NFS mount hangs. Services: atuin, audiobookshelf, calibre, changedetection, diun, excalidraw, forgejo, freshrss, grampsweb, hackmd, health, isponsorblocktv, matrix, meshcentral, n8n, navidrome, ntfy, ollama, onlyoffice, owntracks, paperless-ngx, poison-fountain, send, stirling-pdf, tandoor, wealthfolio, whisper, woodpecker, ytdlp	2026-03-02 00:15:39 +00:00
Viktor Barzin	9e4fb23b10	[ci skip] right-size all pod resources based on VPA + live metrics audit Full cluster resource audit: cross-referenced Goldilocks VPA recommendations, live kubectl top metrics, and Terraform definitions for 100+ containers. Critical fixes: - dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit - stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit - traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi Added explicit resources to ~40 containers that had none: - audiobookshelf, changedetection, cyberchef, dawarich, diun, echo, excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n, navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor, tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver, cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard, k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server, immich-postgresql, osrm-foot GPU containers: added CPU/mem alongside GPU limits: - ollama: removed CPU/mem limits (models vary in size), keep GPU only - frigate: req 500m/2Gi, lim 4/8Gi + GPU - immich-ml: req 100m/1Gi, lim 2/4Gi + GPU Right-sized ~25 over-provisioned containers: - kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi) - onlyoffice: CPU 8 → 2 (VPA upper 45m) - realestate-crawler-api: CPU 2000m → 250m - blog/travel-blog/webhook-handler: 500m → 100m - coturn/health/plotting-book: reduced to match actual usage Conservative methodology: limits = max(VPA upper * 2, live usage * 2)	2026-03-01 19:18:50 +00:00
Viktor Barzin	89a6e08245	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	c7c7047f1c	[ci skip] Flatten module wrappers into stack roots Remove the module "xxx" { source = "./module" } indirection layer from all 66 service stacks. Resources are now defined directly in each stack's main.tf instead of through a wrapper module. - Merge module/main.tf contents into stack main.tf - Apply variable replacements (var.tier -> local.tiers.X, renamed vars) - Fix shared module paths (one fewer ../ at each level) - Move extra files/dirs (factory/, chart_values, subdirs) to stack root - Update state files to strip module.<name>. prefix - Update CLAUDE.md to reflect flat structure Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.	2026-02-22 15:13:55 +00:00

1 2

52 commits