[ci,vault] Fix Tier-1 apply silently failing in Woodpecker

## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. **Fix**: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. **Fix**: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:25:52 +00:00 · 2026-04-19 14:25:52 +00:00 · 2eca011cc3
commit 2eca011cc3
parent 2431c6d5fe
4 changed files with 1918 additions and 1892 deletions
--- a/.woodpecker/default.yml
+++ b/.woodpecker/default.yml
@ -123,6 +123,7 @@ steps:

      # ── Apply platform stacks (serial, with Vault advisory locks) ──
      - |
+        FAILED_PLATFORM_STACKS=""
        if [ -s .platform_apply ]; then
          echo "=== Applying platform stacks (serial, locked) ==="
          while read -r stack; do
@ -137,6 +138,7 @@ steps:
              else
                echo "$OUTPUT" | tail -5
                echo "[$stack] FAILED (exit $EXIT)"
+                FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
              fi
            else
              echo "$OUTPUT" | tail -3
@ -144,9 +146,12 @@ steps:
            fi
          done < .platform_apply
        fi
+        # Deferred until after app stacks so both lists get a chance to run.
+        echo "$FAILED_PLATFORM_STACKS" > .platform_failed

      # ── Apply app stacks (serial, with Vault advisory locks) ──
      - |
+        FAILED_APP_STACKS=""
        if [ -s .app_apply ]; then
          echo "=== Applying app stacks (serial, locked) ==="
          while read -r stack; do
@ -161,6 +166,7 @@ steps:
              else
                echo "$OUTPUT" | tail -5
                echo "[$stack] FAILED (exit $EXIT)"
+                FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
              fi
            else
              echo "$OUTPUT" | tail -3
@ -168,6 +174,15 @@ steps:
            fi
          done < .app_apply
        fi
+        # Fail the step loudly so the pipeline `default` workflow state
+        # reflects reality — the service-upgrade agent and CI alert cascade
+        # both rely on this (see bd code-e1x). Lock-skipped stacks are NOT
+        # counted as failures.
+        FAILED_PLATFORM=$(cat .platform_failed 2>/dev/null | tr -d ' ')
+        if [ -n "$FAILED_PLATFORM" ] || [ -n "$FAILED_APP_STACKS" ]; then
+          echo "=== FAILED STACKS: platform=[$FAILED_PLATFORM ] apps=[$FAILED_APP_STACKS ] ==="
+          exit 1
+        fi

      # ── Commit and push state changes ──
      - |
--- a/stacks/vault/main.tf
+++ b/stacks/vault/main.tf
@ -394,9 +394,14 @@ resource "vault_kubernetes_auth_backend_role" "ci" {
  role_name                        = "ci"
  bound_service_account_names      = ["default"]
  bound_service_account_namespaces = ["woodpecker"]
-  token_policies                   = [vault_policy.ci.name]
-  token_ttl                        = 604800 # 7d
-  token_period                     = 604800 # periodic: auto-renews indefinitely
+  # terraform_state policy grants `database/static-creds/pg-terraform-state`
+  # read — scripts/tg needs this to fetch the Tier-1 PG backend password.
+  # Without it, CI's per-stack `tg apply` dies with
+  # `ERROR: Cannot read PG credentials from Vault` and the default.yml
+  # apply-loop swallows the exit code (set +e) — fixed in bd code-e1x.
+  token_policies = [vault_policy.ci.name, vault_policy.terraform_state.name]
+  token_ttl      = 604800 # 7d
+  token_period   = 604800 # periodic: auto-renews indefinitely
 }

 # --- ESO Policy & Role ---
--- a/stacks/vault/providers.tf
+++ b/stacks/vault/providers.tf
@ -9,6 +9,10 @@ terraform {
      source  = "cloudflare/cloudflare"
      version = "~> 4"
    }
+    authentik = {
+      source  = "goauthentik/authentik"
+      version = "~> 2024.10"
+    }
  }
 }

--- a/state/stacks/vault/terraform.tfstate.enc
+++ b/state/stacks/vault/terraform.tfstate.enc