infra

Author	SHA1	Message	Date
Viktor Barzin	89af09852f	feat(ci): add Vault advisory locks to CI terraform applies CI now uses scripts/tg instead of raw terragrunt apply, acquiring the same per-stack Vault KV lock that user sessions use. This prevents CI from overwriting in-flight user applies. Changes: - Switch from xargs -P 4 (parallel) to serial while-read loop - CI skips stacks locked by users instead of racing them - Git rebase failures now exit 1 instead of silently continuing - Updated header comments to reflect new locking behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:53:00 +00:00
Viktor Barzin	601a83d84e	fix: CI pipeline image pull auth + shallow clone resilience [ci skip] - Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step pods can pull from private registry (registry.viktorbarzin.me:5050) - Add fallback in default.yml when HEAD~1 is unavailable (shallow clone with depth=1): fetch more history, or apply all platform stacks as safe default - Root cause: pipeline #243 failed because infra-ci:latest image couldn't be pulled (no imagePullSecrets on step pods) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:41:08 +00:00
Viktor Barzin	36454b87d1	feat: CI/CD performance overhaul - New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4, git-crypt, sops, kubectl pre-installed. Pushed to private registry. Eliminates 17 apk add calls + binary downloads per pipeline run. - Unified CI pipeline: merge default.yml + app-stacks.yml into one. Changed-stacks-only detection (git diff, with global-file fallback). Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4). Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR). - Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale lock detection. Blocks concurrent applies to same stack. - TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev. - Daily drift detection pipeline (.woodpecker/drift-detection.yml). Runs terraform plan on all stacks, Slack alert on drift. - CI image build pipeline (.woodpecker/build-ci-image.yml). Expected speedup: ~5-10 min per pipeline run → ~2-4 min. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:22:26 +00:00
Viktor Barzin	73511b1230	extract remaining 19 modules from platform, complete stack split [ci skip] Phase 3: all 27 platform modules now run as independent stacks. Platform reduced to empty shell (outputs only) for backward compat with 72 app stacks that declare dependency "platform". Fixed technitium cross-module dashboard reference by copying file. Woodpecker pipeline applies all 27+1 stacks in parallel via loop. All applied with zero destroys.	2026-03-17 21:42:16 +00:00
Viktor Barzin	ae36dc253b	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules.	2026-03-17 21:34:11 +00:00
Viktor Barzin	3c804aedf8	extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip] Phase 1 of platform stack split for parallel CI applies. All 3 modules were fully independent (no cross-module refs). State migrated via terraform state mv. All 3 stacks applied with zero changes (dbaas had pre-existing ResourceQuota drift). Woodpecker pipeline updated to run extracted stacks in parallel.	2026-03-17 18:11:53 +00:00
Viktor Barzin	b6d619e5df	fix: increase terragrunt-apply step memory to 2Gi LimitRange defaults containers to 192Mi which is insufficient for terragrunt apply on the platform stack (48 vault refs, many modules). Set explicit 1Gi request / 2Gi limit via backend_options.	2026-03-15 22:59:34 +00:00
Viktor Barzin	0c1239030d	fix: CI pipeline - disable corrupted cache, add pull before push - build-cli.yml: comment out cache_from/cache_to to avoid BuildKit "short read" errors from corrupted registry cache - default.yml: add git pull --rebase before push in cleanup-and-push to handle remote having newer commits	2026-03-15 22:51:08 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	1f2c1ca361	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	be47592e08	fix: remove deprecated secrets field from slack step	2026-02-28 18:32:10 +00:00
Viktor Barzin	4b6ade7b08	fix: replace removed woodpeckerci/plugin-slack with curl-based webhook	2026-02-28 18:25:23 +00:00
Viktor Barzin	ebecaaee5c	Woodpecker CI: use built-in clone, fix CoreDNS DNS resolution [CI SKIP] - Switch from custom clone override to woodpeckerci/plugin-git built-in clone (handles auth automatically via netrc from GitHub OAuth token) - Add 8.8.8.8 and 1.1.1.1 as CoreDNS upstream resolvers alongside pfSense (fixes intermittent DNS timeouts causing clone failures) - Fix missing comma after heredoc in audit-policy.tf (syntax error)	2026-02-23 00:08:42 +00:00
Viktor Barzin	cbf041bcc9	[ci skip] Add Woodpecker CI stack (WIP) and claude agents - Add stacks/woodpecker/ with Helm-based deployment config - Add .woodpecker/ CI pipeline configs (default, build-cli, renew-tls) - Add NFS export entry for woodpecker - Add .claude/agents/ definitions	2026-02-22 21:30:25 +00:00

14 commits