infra

Author	SHA1	Message	Date
Viktor Barzin	f0ce7b0363	fire-planner: add stack, Vault DB role, dashboard, DB New stacks/fire-planner/ mirrors payslip-ingest layout: - ExternalSecret pulling RECOMPUTE_BEARER_TOKEN from Vault secret/fire-planner - DB ExternalSecret templating DB_CONNECTION_STRING via static role pg-fire-planner - FastAPI Deployment (serve), CronJob (recompute-all monthly on 2nd at 09:00 UTC, scheduled after wealthfolio-sync's 1st at 08:00), ClusterIP Service - Grafana datasource ConfigMap "FirePlanner" — `database` inside jsonData (`cc56ba29` fix; otherwise Grafana 11.2+ hits "you do not have default database") Plus: - vault/main.tf: pg-fire-planner static role (7d rotation), allowed_roles - dbaas/modules/dbaas/main.tf: null_resource creates fire_planner DB+role - monitoring/dashboards/fire-planner.json: 9-panel Finance-folder dashboard (NW timeseries, MC fan chart, success heatmap, lifetime tax bars, years-to-ruin table, optimal leave-UK stat, ending wealth stat, UK success-by-strategy bars, sequence-risk correlation table) - monitoring/modules/monitoring/grafana.tf: register "fire-planner.json" in Finance folder Apply order: 1. vault stack — creates the static role 2. dbaas stack — creates the database & role 3. external-secrets stack picks up vault-database refs (no change needed) 4. fire-planner stack — first apply with -target=kubernetes_manifest.db_external_secret before full apply, per the plan-time-data-source pattern 5. monitoring stack — picks up the new dashboard ConfigMap [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 17:27:19 +00:00
Viktor Barzin	484b4c7190	vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1 + vault-2 today). The NFS fsync incompatibility identified in the 2026-04-22 raft-leader-deadlock post-mortem is no longer reachable — raft consensus log + audit log live on LUKS2 block storage with real fsync semantics. Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox dropped to zero after the rolling, so the resource is removed from infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster and will be reclaimed in Phase 3 cleanup. Lesson learned (recorded in plan): pvc-protection finalizer races the StatefulSet controller — pod recreates on the OLD PVCs unless the finalizer is patched out before pod delete. Force-finalize technique applied to vault-1 + vault-2 successfully. Closes: code-gy7h	2026-04-25 17:10:00 +00:00
Viktor Barzin	bf4c7618d8	wealth: SQLite→PG ETL sidecar + new Grafana dashboard Mirrors Wealthfolio's daily_account_valuation / accounts / activities from SQLite into a new PG database (wealthfolio_sync) every hour, so Grafana can chart net worth, contributions, and growth over time. Components: - dbaas: null_resource creates wealthfolio_sync DB + role on the CNPG cluster (dynamic primary lookup so it survives failover). - vault: pg-wealthfolio-sync static role rotates the password every 7d. - wealthfolio: ExternalSecret pulls the rotated password into the WF namespace; new pg-sync sidecar (alpine + sqlite + postgresql-client + busybox crond) does sqlite3 .backup → TSV dump → truncate-and-reload psql, hourly at :07. Plus a grafana-wealth-datasource ConfigMap in the monitoring namespace (uid: wealth-pg). - monitoring: new Wealth dashboard (wealth.json, 10 panels) — current net worth / contribution / growth / ROI% stats, then time-series for net worth, contribution-vs-market, growth area, per-account stacked area, cash-vs-invested, and a 100-row activity log. Initial sync: 6 accounts, 10,798 daily valuations, 518 activities. Verified PG totals match SQLite latest snapshot exactly.	2026-04-25 17:07:33 +00:00
Viktor Barzin	288efa89b3	vault: migrate vault-0 storage to proxmox-lvm-encrypted Phase 2 of the NFS-hostile migration: data + audit storageClass on the vault helm release switches from nfs-proxmox to proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between). vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part is what makes this safe (raft quorum maintained by 2 healthy pods while one is replaced). Also restores chart-default pod securityContext fields. The previous `statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}` block REPLACED (not merged) the chart's defaults — fsGroup, runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS exports were permissive enough to mask the missing fsGroup; ext4 LV volume root is root:root and the vault user (UID 100) couldn't open vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly, survives future chart bumps. vault-1 and vault-2 retained their correct securityContext from when their pod specs were written to etcd, before the partial customization landed — the bug only surfaces when a pod is recreated. Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap (recovery anchor). Refs: code-gy7h Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:19:49 +00:00
Viktor Barzin	2f1f9107f8	vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that never exited because the default fsGroupChangePolicy (Always) walks every file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and a 1GB audit log, the recursive chown outlasted the deadline and restarted forever — blocking raft quorum recovery. OnRootMismatch makes chown a no-op when the volume root is already correct, which it always is after initial setup. The breakglass fix was applied live via kubectl patch at 10:54 UTC; this commit persists it in Terraform so the next apply doesn't revert. The post-mortem also documents the upstream raft stuck-leader pattern, NFS kernel client corruption after force-kill, and the path to migrate Vault off NFS to proxmox-lvm-encrypted.	2026-04-22 11:12:19 +00:00
Viktor Barzin	e7ce545da2	[job-hunter] Add infra stack + Grafana dashboard + n8n digest workflow New service stack at stacks/job-hunter/ mirroring the payslip-ingest pattern: per-service CNPG database + role (via dbaas null_resource), Vault static role pg-job-hunter (7d rotation), ExternalSecrets for app secrets and DB creds, Deployment with alembic-migrate init container, ClusterIP Service, Grafana datasource ConfigMap. Grafana dashboard job-hunter.json in Finance folder: new roles per day, source breakdown, top companies, GBP salary distribution, recent roles table (sorted by parse confidence then salary). n8n weekly-digest workflow calls POST /digest/generate with bearer auth every Monday 07:00 London; digest_runs table provides idempotency. Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:09:29 +00:00
Viktor Barzin	2eca011cc3	[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. Fix: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. Fix: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" \| jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:25:52 +00:00
Viktor Barzin	327ce215b9	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] ## Context Wave 3A (commit `c9d221d5`) added the `# KYVERNO_LIFECYCLE_V1` marker to the 27 pre-existing `ignore_changes = [...dns_config]` sites so they could be grepped and audited. It did NOT address pod-owning resources that were simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18) found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec, and many other stacks showed perpetual `dns_config` drift every plan because their `kubernetes_deployment` / `kubernetes_stateful_set` / `kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all. Root cause (same as Wave 3A): Kyverno's admission webhook stamps `dns_config { option { name = "ndots"; value = "2" } }` on every pod's `spec.template.spec.dns_config` to prevent NxDomain search-domain flooding (see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes` on every Terraform-managed pod-owner, Terraform repeatedly tries to strip the injected field. ## This change Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`, `kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each carries the right `ignore_changes` path: - kubernetes_deployment / stateful_set / daemon_set / job_v1: `spec[0].template[0].spec[0].dns_config` - kubernetes_cron_job_v1: `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is one level deeper) Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2` inline so the suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`. Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`): 1. No existing `lifecycle {}`: inject a brand-new block just before the resource's closing `}`. 108 new blocks on 93 files. 2. Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag` from Wave 4, commit a62b43d1): extend its `ignore_changes` list with the dns_config path. Handles both inline (`= [x]`) and multiline (`= [\n x,\n]`) forms; ensures the last pre-existing list item carries a trailing comma so the extended list is valid HCL. 34 extensions. The script skips anything already mentioning `dns_config` inside an `ignore_changes`, so re-running is a no-op. ## Scale - 142 total lifecycle injections/extensions - 93 `.tf` files touched - 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones - Every Tier 0 and Tier 1 stack with a pod-owning resource is covered - Together with Wave 3A's 27 pre-existing markers → 169 greppable `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo ## What is NOT in this change - `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … /`). Python script touched the file, reverted manually. - `_template/main.tf.example` skeleton — kept minimal on purpose; any future stack created from it should either inherit the Wave 3A one-line form or add its own on first `kubernetes_deployment`. - `terraform fmt` fixes to pre-existing alignment issues in meshcentral, nvidia/modules/nvidia, vault — unrelated to this commit. Left for a separate fmt-only pass. - Non-pod resources (`kubernetes_service`, `kubernetes_secret`, `kubernetes_manifest`, etc.) — they don't own pods so they don't get Kyverno dns_config mutation. ## Verification Random sample post-commit: ``` $ cd stacks/navidrome && ../../scripts/tg plan → No changes. $ cd stacks/f1-stream && ../../scripts/tg plan → No changes. $ cd stacks/frigate && ../../scripts/tg plan → No changes. $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='*.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 169 ``` ## Reproduce locally 1. `git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` → 169+ 3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on the deployment's dns_config field. Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest annotation class handled separately in `8d94688d` for tls_secret) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:19:48 +00:00
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	43b4e1d372	[payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role ## Context New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`) needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana datasource, a dashboard, and a Claude agent definition for PDF extraction. Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace. No ingress, no TLS cert, no DNS record. ## What ### New stack `stacks/payslip-ingest/` - `kubernetes_namespace` payslip-ingest, tier=aux. - ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN, WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`. - ExternalSecret (vault-database) reads rotating password from `static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into `payslip-ingest-db-creds` with `reloader.stakater.com/match=true`. - Deployment: single replica, Recreate strategy (matches single-worker queue design), `wait-for postgresql.dbaas:5432` annotation, init container runs `alembic upgrade head`, main container serves FastAPI on 8080, Kyverno dns_config lifecycle ignore. - ClusterIP Service :8080. - Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`, uid `payslips-pg`) reading password from the db-creds K8s Secret. ### Grafana dashboard `uk-payslip.json` (4 panels) - Monthly gross/net/tax/NI (timeseries, currencyGBP). - YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140. - Deductions breakdown (stacked bars). - Effective rate + take-home % (timeseries, percent). ### Vault DB role `pg-payslip-ingest` - Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`. - New `vault_database_secret_backend_static_role.pg_payslip_ingest` (username `payslip_ingest`, 7d rotation). ### DBaaS — DB + role creation - New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`: idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into `pg-cluster-1`. ### Claude agent `.claude/agents/payslip-extractor.md` - Haiku-backed agent invoked by `claude-agent-service`. - Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single JSON object matching the schema to stdout. No network, no file writes outside /tmp, no markdown fences. ## Trade-offs / decisions - Own DB per service (convention), NOT a schema in a shared `app` DB as the plan initially described. The Alembic migration still creates a `payslip_ingest` schema inside the `payslip_ingest` DB for table organisation. - Paperless URL uses port 80 (the Service port), not 8000 (the pod target port). - Grafana datasource uses the primary RW user — separate `_ro` role is aspirational and not yet a pattern in this repo. - No ingress — webhook is cluster-internal; external exposure is unnecessary attack surface. - No Uptime Kuma monitor yet: the internal-monitor list is a static block in `stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor auto-creator). ## Test Plan ### Automated ``` terraform init -backend=false && terraform validate Success! The configuration is valid. terraform fmt -check -recursive (exit 0) python3 -c "import json; json.load(open('uk-payslip.json'))" (exit 0) ``` ### Manual Verification (post-merge) Prerequisites: 1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`. 2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`. Apply: 3. `scripts/tg apply vault` → creates pg-payslip-ingest static role. 4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role. 5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret` (first-apply ESO bootstrap). 6. `scripts/tg apply payslip-ingest` (full). 7. `kubectl -n payslip-ingest get pods` → Running 1/1. 8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200. End-to-end: 9. Configure Paperless workflow (README in code repo has steps). 10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s. 11. Grafana → Dashboards → UK Payslip → 4 panels render. Closes: code-do7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:07:05 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	216d4240c9	[infra] Add Cloudflare provider to all stack lock files and generated providers Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key) and includes cloudflare in required_providers. These are the generated files from running `terragrunt init -upgrade` across all stacks. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:36 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	30cdeefb1c	chore: sync terraform state after nfsvers=4 convergence Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4). State files encrypted and committed. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:20:18 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	4d3d3316ab	feat(phpipam): deploy phpIPAM for live IP address management Lightweight IPAM with auto-discovery scanning every 15min via fping. Replaces disabled NetBox (OOM'd). Uses existing MySQL InnoDB cluster with Vault-rotated credentials. Cloudflare DNS + Authentik auth. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 14:19:25 +00:00
Viktor Barzin	9492874c43	fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip] Query logs stopped syncing on 2026-03-16 due to password mismatch after MySQL cluster rebuild and Technitium app config reset. - Add Vault static role mysql-technitium (7-day rotation) - Add ExternalSecret for technitium-db-creds in technitium namespace - Add password-sync CronJob (6h) to push rotated password to Technitium API - Update Grafana datasource to use ESO-managed password - Remove stale technitium_db_password variable (replaced by ESO) - Update databases.md and restore-mysql.md runbook	2026-04-06 13:00:49 +03:00
Viktor Barzin	f80e1fa868	cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal - NFS CSI: fix liveness-probe port conflict (29652 → 29653) - Immich ML: add gpu-workload priority class to enable preemption on node1 - dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi) - Redis: add redis-master service via HAProxy for master-only routing, update config.tfvars redis_host to use it - CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53) instead of stale LoadBalancer IP (10.0.20.200) - Trading bot: comment out all resources (no longer needed) - Vault: remove trading-bot PostgreSQL database role	2026-04-06 11:54:45 +03:00
Viktor Barzin	70ea01fb6e	vault: increase k8s auth token TTLs and add periodic renewal Stagger token periods across roles (7d/8d/9d/10d) to prevent bulk lease revocation storms that caused transient 504s. Periodic tokens auto-renew indefinitely, eliminating mass expiry.	2026-03-26 12:21:47 +02:00
Viktor Barzin	d20c5e5535	add backup_output_bytes metric and cloudsync_transferred_bytes to backup dashboard - All 7 backup CronJobs now push backup_output_bytes (file size after backup) - Cloud Sync monitor parses rclone transfer stats into cloudsync_transferred_bytes - Grafana dashboard: new Output (MiB) table column, Output Size Trend panel, Write Throughput panel, Cloud Sync Transfer Volume bargauge - All timeseries panels use points-only draw style (discrete backup snapshots) - etcd backup restructured: init_container for etcdctl (distroless image), busybox sidecar for metrics push + purge, ClusterFirstWithHostNet DNS - Fixed pre-existing curl missing in postgres:16.4-bullseye (immich, dbaas PG) - Fixed grep -oP not available in alpine/busybox (cloud sync monitor)	2026-03-25 10:44:53 +02:00
Viktor Barzin	a95d434ff1	fix backup IO stats: use /proc/$$/io instead of /proc/self/io /proc/self/io inside $(awk ...) resolves to the awk subprocess PID, not the parent bash shell. Use $$ (bash PID) to read the correct process IO counters.	2026-03-23 12:33:52 +02:00
Viktor Barzin	0a294a30a6	add backup IO logging, Pushgateway metrics, and Grafana dashboard - Add /proc/self/io read/write tracking to vault raft-backup and etcd backup - Push backup_duration_seconds, backup_read_bytes, backup_written_bytes, backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs (etcd skipped — distroless image has no wget/curl) - Add cloudsync_duration_seconds metric to cloudsync-monitor - New "Backup Health" Grafana dashboard with 8 panels: time since last backup, overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule	2026-03-23 12:19:01 +02:00
Viktor Barzin	e463281205	optimize backup schedules: compress dumps, stagger to weekly, extend retention - dbaas: gzip MySQL/PostgreSQL dumps, stagger to 0:30, clean old uncompressed - infra-maintenance: etcd backup daily→weekly Sunday 1am - redis: backup hourly→weekly Sunday 3am, retention 7→28 days - vault: raft backup daily→weekly Sunday 2am	2026-03-23 02:24:34 +02:00
Viktor Barzin	e823b795f7	fix(dbaas,vault): fix backup CronJob failures and mysql-operator memory - Add docker.io/library/ prefix to mysql and postgres backup images to satisfy Kyverno require-trusted-registries policy (both CronJobs were blocked for 46h, triggering MySQLBackupStale alert) - Document mysql-operator chart ignoring resources values key — the LimitRange default (256Mi) was silently applied, putting the operator at 97% memory. Patched live to 512Mi via kubectl. - Increase vault-raft-backup backoff_limit to 6 for transient failures (also fixed NFS export: vault-backup was a separate ZFS dataset not in the TrueNAS NFS share — destroyed dataset, created directory)	2026-03-19 23:26:05 +00:00
Viktor Barzin	21bb3036af	state(dbaas): update encrypted state	2026-03-19 20:23:59 +00:00
Viktor Barzin	fd130971aa	feat(provision): automated user provisioning via Authentik webhook - Expand CI Vault policy: write secret/data/platform + Transit SOPS keys - Add Woodpecker provision-user.yml pipeline (manual event, API-triggered) - Add env vars to webhook-handler deployment for Woodpecker/Authentik integration - Update add-user skill with automated flow documentation - Update Woodpecker repo ID list in CLAUDE.md	2026-03-17 23:56:30 +00:00
Viktor Barzin	ccbcebb670	feat(vault): automate SOPS onboarding for namespace-owners - Add Transit mount + per-stack Transit keys to vault stack TF - Auto-create sops-user-<name> policy scoping decrypt to owned stacks - Auto-create sops-<name> external group + alias for Authentik mapping - Add sops-admin policy to authentik-admins group - Attach sops-user policy to namespace-owner identity entities - Update add-user skill with SOPS onboarding steps and Authentik group - Adding a user to k8s_users + applying vault stack = full SOPS access [ci skip]	2026-03-17 23:15:25 +00:00
Viktor Barzin	8d8c8db737	increase DB password rotation from 24h to weekly (604800s)	2026-03-16 23:17:01 +00:00
Viktor Barzin	50620e6047	add generic multi-user cluster onboarding system Data-driven user onboarding: add a JSON entry to Vault KV k8s_users, apply vault + platform + woodpecker stacks, and everything is auto-generated. Vault stack: namespace creation, per-user Vault policies with secret isolation via identity entities/aliases, K8s deployer roles, CI policy update. Platform stack: domains field in k8s_users type, TLS secrets per user namespace, user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal. Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true. K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt, contributing page with CI pipeline template, versioned image tags in CI pipeline. New: stacks/_template/ with copyable stack template for namespace-owners.	2026-03-15 22:23:36 +00:00
Viktor Barzin	06a0d0599a	regenerate providers.tf: remove vault_root_token variable [ci skip]	2026-03-15 21:21:01 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	23019da8e5	equalize memory req=lim across 70+ containers using Prometheus 7d max data After node2 OOM incident, right-size memory across the cluster by setting requests=limits based on max_over_time(container_memory_working_set_bytes[7d]) with 1.3x headroom. Eliminates ~37Gi overcommit gap. Categories: - Safe equalization (50 containers): set req=lim where max7d well within target - Limit increases (8 containers): raise limits for services spiking above current - No Prometheus data (12 containers): conservatively set lim=req - Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql 4Gi limits across 3 replicas.	2026-03-14 21:46:49 +00:00
Viktor Barzin	f7c2c06009	right-size memory: set requests=limits based on actual usage - Set memory requests = limits across 56 stacks to prevent overcommit - Right-sized limits based on actual pod usage (2x actual, rounded up) - Scaled down trading-bot (replicas=0) to free memory - Fixed OOMKilled services: forgejo, dawarich, health, meshcentral, paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse - Added startup+liveness probes to calibre-web - Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192) Post node2 OOM incident (2026-03-14). Previous kubelet config had no kubeReserved/systemReserved set, allowing pods to starve the kernel.	2026-03-14 21:01:24 +00:00
Viktor Barzin	98d7c2a4a5	fix: resolve HCL semicolons and vault-platform dependency cycle - Replace semicolons with newlines in vault/main.tf variable blocks (HCL does not support semicolons) - Remove dependency "vault" from platform/terragrunt.hcl to break cycle (vault already depends on platform)	2026-03-14 17:37:25 +00:00
Viktor Barzin	a8d944eb9b	migrate all secrets from SOPS to Vault KV - Add vault provider to root terragrunt.hcl (generated providers.tf) - Delete stacks/vault/vault_provider.tf (now in generated providers.tf) - Add 124 variable declarations + 43 vault_kv_secret_v2 resources to vault/main.tf to populate Vault KV at secret/<stack-name> - Migrate 43 consuming stacks to read secrets from Vault KV via data "vault_kv_secret_v2" instead of SOPS var-file - Add dependency "vault" to all migrated stacks' terragrunt.hcl - Complex types (maps/lists) stored as JSON strings, decoded with jsondecode() in locals blocks Bootstrap secrets (vault_root_token, vault_authentik_client_id, vault_authentik_client_secret) remain in SOPS permanently. Apply order: vault stack first (populates KV), then all others.	2026-03-14 17:15:48 +00:00
Viktor Barzin	27fa8ea18f	Hide Vault OIDC from main login dropdown OIDC popup flow hangs due to Authentik X-Frame-Options. Keep OIDC accessible via the "Other" tab instead.	2026-03-14 14:12:16 +00:00
Viktor Barzin	1dec7e6bea	Add Vault OIDC authentication via Authentik Configure Vault to use Authentik as OIDC identity provider for SSO login. Creates OAuth2 provider/application in Authentik, adds OIDC auth backend, admin policy, and maps "authentik Admins" group to full vault-admin access.	2026-03-14 13:53:05 +00:00

38 commits