infra/.woodpecker/default.yml

271 lines
11 KiB
YAML
Raw Normal View History

# Unified infra CI pipeline — detects changed stacks and applies only those.
# Platform stacks and app stacks handled in one pipeline with proper ordering.
#
# Optimizations over the previous split pipeline:
# - Custom CI image (no apk/wget per step)
# - Shallow clone (depth=2 for git diff HEAD~1)
# - TF_PLUGIN_CACHE_DIR (shared provider cache)
# - Serial apply with Vault advisory locks (prevents user/CI race conditions)
# - Step consolidation (2 steps instead of 4)
# - Changed-stacks-only detection (skips no-op applies)
# - Global-file fallback (modules/config changes trigger full apply)
# - Lock-aware: skips stacks locked by users instead of failing
when:
event: push
branch: master
clone:
git:
image: woodpeckerci/plugin-git
settings:
depth: 2
attempts: 5
backoff: 10s
steps:
- name: apply
[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 18:30:02 +00:00
image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
pull: true
backend_options:
kubernetes:
resources:
requests:
memory: 3Gi
limits:
memory: 6Gi
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
[ci] Persist VAULT_TOKEN across Woodpecker step commands ## Context Follow-up to commit 2eca011c (bd code-e1x). That commit attached the `terraform-state` policy to the `ci` Vault role and propagated apply- loop failures so the pipeline actually fails when a stack fails. On the very first push to exercise it (pipeline 361), the platform apply step died with: [vault] Starting apply... state-sync: ERROR — no Vault token and no age key at ~/.config/sops/age/keys.txt [vault] FAILED (exit 1) Root cause: in Woodpecker's `commands:` list, each `- |` item runs in a fresh shell. The dedicated "Vault auth" command was doing `export VAULT_TOKEN=...`, but that export was lost by the time the apply command ran. Tier-0 stacks depended on Vault Transit (via `scripts/state-sync`), and Tier-1 stacks depend on `vault read database/static-creds/pg-terraform-state` via `scripts/tg` — both silently fell through to their "no Vault" error path. This bug was latent before 2eca011c because the old apply loop swallowed per-stack exit codes. Now that we surface them, the pipeline fails honestly — but fails on every run. Fixing the missing token propagation is the last mile. ## This change - Pin `VAULT_ADDR` at the step's `environment:` level so every command inherits it without an explicit export. - In the Vault auth command, assert the auth succeeded (non-empty, non-"null" token) then write the token to `~/.vault-token` with `umask 077`. `vault`, `scripts/tg`, and `scripts/state-sync` all fall through to `~/.vault-token` when `VAULT_TOKEN` env is unset. ## What is NOT in this change - A broader refactor to fold the multi-step chain into a single `- |` script — preserving the existing granular structure keeps individual step logs grep-friendly and failures localised. - Restoring the VAULT_TOKEN export too — redundant once ~/.vault-token is written, and would need duplicating into each command anyway. ## Test Plan ### Automated N/A (pure YAML change). Will be verified by the very next CI run — the push creating this commit. ### Manual Verification Watch `ci.viktorbarzin.me/repos/1/pipelines` for the pipeline whose commit matches this one. Expected: - `default` workflow exercises the auth + apply steps. - Platform apply for `vault` stack runs state-sync decrypt → detects no drift (I applied locally already) → OK. - Tier-1 stacks (if any in the diff): `vault read database/static- creds/pg-terraform-state` returns creds → apply runs. - No "state-sync: ERROR" or "Cannot read PG credentials" errors. - `default` workflow state: success. - Overall pipeline status: still failure because `build-cli` is independently broken (bd code-12b); that's cosmetic. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:30:39 +00:00
# Each `- |` command runs in a fresh shell, so we can't rely on an
# `export VAULT_ADDR=...` in the auth command persisting — pin it at
# step level. VAULT_TOKEN is still per-command; we persist it to
# ~/.vault-token (auto-read by `vault` CLI) so downstream commands
# don't need explicit token propagation.
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
commands:
# ── Skip CI commits ──
- |
if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
echo "Commit has [CI SKIP], exiting"
exit 0
fi
# ── git-crypt unlock ──
- |
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sk "https://10.0.20.100:6443/api/v1/namespaces/woodpecker/configmaps/git-crypt-key" \
-H "Authorization:Bearer $SA_TOKEN" | jq -r .data.key | base64 -d > /tmp/key
git-crypt unlock /tmp/key && rm /tmp/key
# ── Vault auth ──
- |
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
[ci] Persist VAULT_TOKEN across Woodpecker step commands ## Context Follow-up to commit 2eca011c (bd code-e1x). That commit attached the `terraform-state` policy to the `ci` Vault role and propagated apply- loop failures so the pipeline actually fails when a stack fails. On the very first push to exercise it (pipeline 361), the platform apply step died with: [vault] Starting apply... state-sync: ERROR — no Vault token and no age key at ~/.config/sops/age/keys.txt [vault] FAILED (exit 1) Root cause: in Woodpecker's `commands:` list, each `- |` item runs in a fresh shell. The dedicated "Vault auth" command was doing `export VAULT_TOKEN=...`, but that export was lost by the time the apply command ran. Tier-0 stacks depended on Vault Transit (via `scripts/state-sync`), and Tier-1 stacks depend on `vault read database/static-creds/pg-terraform-state` via `scripts/tg` — both silently fell through to their "no Vault" error path. This bug was latent before 2eca011c because the old apply loop swallowed per-stack exit codes. Now that we surface them, the pipeline fails honestly — but fails on every run. Fixing the missing token propagation is the last mile. ## This change - Pin `VAULT_ADDR` at the step's `environment:` level so every command inherits it without an explicit export. - In the Vault auth command, assert the auth succeeded (non-empty, non-"null" token) then write the token to `~/.vault-token` with `umask 077`. `vault`, `scripts/tg`, and `scripts/state-sync` all fall through to `~/.vault-token` when `VAULT_TOKEN` env is unset. ## What is NOT in this change - A broader refactor to fold the multi-step chain into a single `- |` script — preserving the existing granular structure keeps individual step logs grep-friendly and failures localised. - Restoring the VAULT_TOKEN export too — redundant once ~/.vault-token is written, and would need duplicating into each command anyway. ## Test Plan ### Automated N/A (pure YAML change). Will be verified by the very next CI run — the push creating this commit. ### Manual Verification Watch `ci.viktorbarzin.me/repos/1/pipelines` for the pipeline whose commit matches this one. Expected: - `default` workflow exercises the auth + apply steps. - Platform apply for `vault` stack runs state-sync decrypt → detects no drift (I applied locally already) → OK. - Tier-1 stacks (if any in the diff): `vault read database/static- creds/pg-terraform-state` returns creds → apply runs. - No "state-sync: ERROR" or "Cannot read PG credentials" errors. - `default` workflow state: success. - Overall pipeline status: still failure because `build-cli` is independently broken (bd code-12b); that's cosmetic. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:30:39 +00:00
VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
[ci] Persist VAULT_TOKEN across Woodpecker step commands ## Context Follow-up to commit 2eca011c (bd code-e1x). That commit attached the `terraform-state` policy to the `ci` Vault role and propagated apply- loop failures so the pipeline actually fails when a stack fails. On the very first push to exercise it (pipeline 361), the platform apply step died with: [vault] Starting apply... state-sync: ERROR — no Vault token and no age key at ~/.config/sops/age/keys.txt [vault] FAILED (exit 1) Root cause: in Woodpecker's `commands:` list, each `- |` item runs in a fresh shell. The dedicated "Vault auth" command was doing `export VAULT_TOKEN=...`, but that export was lost by the time the apply command ran. Tier-0 stacks depended on Vault Transit (via `scripts/state-sync`), and Tier-1 stacks depend on `vault read database/static-creds/pg-terraform-state` via `scripts/tg` — both silently fell through to their "no Vault" error path. This bug was latent before 2eca011c because the old apply loop swallowed per-stack exit codes. Now that we surface them, the pipeline fails honestly — but fails on every run. Fixing the missing token propagation is the last mile. ## This change - Pin `VAULT_ADDR` at the step's `environment:` level so every command inherits it without an explicit export. - In the Vault auth command, assert the auth succeeded (non-empty, non-"null" token) then write the token to `~/.vault-token` with `umask 077`. `vault`, `scripts/tg`, and `scripts/state-sync` all fall through to `~/.vault-token` when `VAULT_TOKEN` env is unset. ## What is NOT in this change - A broader refactor to fold the multi-step chain into a single `- |` script — preserving the existing granular structure keeps individual step logs grep-friendly and failures localised. - Restoring the VAULT_TOKEN export too — redundant once ~/.vault-token is written, and would need duplicating into each command anyway. ## Test Plan ### Automated N/A (pure YAML change). Will be verified by the very next CI run — the push creating this commit. ### Manual Verification Watch `ci.viktorbarzin.me/repos/1/pipelines` for the pipeline whose commit matches this one. Expected: - `default` workflow exercises the auth + apply steps. - Platform apply for `vault` stack runs state-sync decrypt → detects no drift (I applied locally already) → OK. - Tier-1 stacks (if any in the diff): `vault read database/static- creds/pg-terraform-state` returns creds → apply runs. - No "state-sync: ERROR" or "Cannot read PG credentials" errors. - `default` workflow state: success. - Overall pipeline status: still failure because `build-cli` is independently broken (bd code-12b); that's cosmetic. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:30:39 +00:00
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
echo "ERROR: Vault K8s auth failed (role=ci, ns=woodpecker)" >&2
exit 1
fi
# Persist for downstream `- |` blocks (each runs in a fresh shell,
# so exporting VAULT_TOKEN wouldn't help). `vault`, `scripts/tg`,
# and `scripts/state-sync` all fall through to ~/.vault-token when
# the env var is unset.
umask 077; printf '%s' "$VAULT_TOKEN" > "$HOME/.vault-token"
# ── Generate kubeconfig from projected SA token ──
# terragrunt.hcl injects `-var kube_config_path=<repo>/config` for every
# terraform invocation, so we need a kubeconfig file at that path. The
# `default` SA in the woodpecker namespace is cluster-admin (via the
# `woodpecker-default` ClusterRoleBinding), so the projected token is
# sufficient to apply any stack. Using `tokenFile` (not an inline token)
# so the provider re-reads it if kubelet rotates the projected token
# mid-pipeline.
- |
cat > config <<'EOF'
apiVersion: v1
kind: Config
clusters:
- name: kubernetes
cluster:
server: https://10.0.20.100:6443
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
contexts:
- name: ci
context:
cluster: kubernetes
user: ci
current-context: ci
users:
- name: ci
user:
tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
EOF
chmod 600 config
# Sanity check: kubeconfig works
kubectl --kubeconfig=config get ns kube-system -o name >/dev/null
# ── Detect changed stacks ──
- |
PLATFORM_STACKS="dbaas authentik crowdsec monitoring nvidia mailserver cloudflared kyverno metallb redis traefik technitium headscale rbac k8s-portal vaultwarden reverse-proxy metrics-server vpa nfs-csi iscsi-csi cnpg sealed-secrets uptime-kuma wireguard xray infra-maintenance platform vault reloader descheduler external-secrets"
# Ensure we have enough history for diff (clone may be shallow)
if ! git rev-parse HEAD~1 >/dev/null 2>&1; then
echo "WARNING: HEAD~1 not available (shallow clone?) — fetching more history"
git fetch --deepen=1 origin master 2>/dev/null || true
fi
# If still no parent, apply all platform stacks as a safe fallback
if ! git rev-parse HEAD~1 >/dev/null 2>&1; then
echo "Cannot determine changed files — applying ALL platform stacks"
echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply
> .app_apply
else
# Check if global files changed (triggers full platform apply)
GLOBAL_CHANGED=$(git diff --name-only HEAD~1 HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true)
if [ -n "$GLOBAL_CHANGED" ]; then
echo "Global files changed — applying ALL platform stacks"
echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply
else
# Detect platform stacks that changed
git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed
> .platform_apply
while read -r stack; do
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
echo "$stack" >> .platform_apply
fi
done < .all_changed
fi
# Detect app stacks that changed
> .app_apply
git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
continue # Skip platform stacks
fi
if [ ! -f "stacks/$stack/terragrunt.hcl" ]; then
continue # Skip non-terragrunt dirs
fi
echo "$stack" >> .app_apply
done
fi
PLATFORM_COUNT=$(wc -l < .platform_apply | tr -d ' ')
APP_COUNT=$(wc -l < .app_apply | tr -d ' ')
echo "Platform stacks to apply: $PLATFORM_COUNT"
echo "App stacks to apply: $APP_COUNT"
cat .platform_apply .app_apply
# ── Pre-warm provider cache ──
- |
if [ -s .platform_apply ] || [ -s .app_apply ]; then
FIRST_STACK=$(cat .platform_apply .app_apply 2>/dev/null | head -1)
if [ -n "$FIRST_STACK" ]; then
echo "Pre-warming provider cache from stacks/$FIRST_STACK..."
cd "stacks/$FIRST_STACK" && terragrunt init --terragrunt-non-interactive -input=false 2>&1 | tail -3 && cd ../..
fi
fi
# ── Apply platform stacks (serial, with Vault advisory locks) ──
- |
[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. **Fix**: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. **Fix**: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
FAILED_PLATFORM_STACKS=""
if [ -s .platform_apply ]; then
echo "=== Applying platform stacks (serial, locked) ==="
while read -r stack; do
echo "[$stack] Starting apply..."
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. **Fix**: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. **Fix**: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
fi
done < .platform_apply
fi
[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. **Fix**: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. **Fix**: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
# Deferred until after app stacks so both lists get a chance to run.
echo "$FAILED_PLATFORM_STACKS" > .platform_failed
# ── Apply app stacks (serial, with Vault advisory locks) ──
- |
[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. **Fix**: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. **Fix**: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
FAILED_APP_STACKS=""
if [ -s .app_apply ]; then
echo "=== Applying app stacks (serial, locked) ==="
while read -r stack; do
echo "[$stack] Starting apply..."
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. **Fix**: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. **Fix**: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
fi
done < .app_apply
fi
[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. **Fix**: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. **Fix**: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00
# Fail the step loudly so the pipeline `default` workflow state
# reflects reality — the service-upgrade agent and CI alert cascade
# both rely on this (see bd code-e1x). Lock-skipped stacks are NOT
# counted as failures.
FAILED_PLATFORM=$(cat .platform_failed 2>/dev/null | tr -d ' ')
if [ -n "$FAILED_PLATFORM" ] || [ -n "$FAILED_APP_STACKS" ]; then
echo "=== FAILED STACKS: platform=[$FAILED_PLATFORM ] apps=[$FAILED_APP_STACKS ] ==="
exit 1
fi
# ── Commit and push state changes ──
- |
mkdir -p ~/.ssh && ssh-keyscan -H github.com >> ~/.ssh/known_hosts 2>/dev/null
chmod 400 secrets/deploy_key
git add stacks/ state/ .woodpecker/ 2>/dev/null || true
git remote set-url origin git@github.com:ViktorBarzin/infra.git
git diff --cached --quiet && echo "No changes to commit" && exit 0
git commit -m "Woodpecker CI deploy [CI SKIP]"
GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git fetch origin master
if ! GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git rebase origin/master; then
echo "ERROR: Git rebase failed — state commits could not be pushed"
echo "Manual intervention required: pull, resolve conflicts, push"
GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git rebase --abort || true
exit 1
fi
GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master
# ── Slack notification ──
- |
PLATFORM_COUNT=$(wc -l < .platform_apply 2>/dev/null | tr -d ' ')
APP_COUNT=$(wc -l < .app_apply 2>/dev/null | tr -d ' ')
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS} (platform:${PLATFORM_COUNT}, apps:${APP_COUNT})\"}" \
"$SLACK_WEBHOOK" || true
# Slack on failure (runs even if apply step fails)
- name: notify-failure
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"channel\":\"general\",\"text\":\":red_circle: Woodpecker CI: infra pipeline FAILED\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
status: [failure]