[ci,vault] Fix Tier-1 apply silently failing in Woodpecker
## Context
For weeks, every push to infra has resulted in `build-cli` workflow
failure AND `default` workflow succeed — but the `default` workflow's
"success" was a lie. Inside the apply-loop we were swallowing per-stack
failures with `set +e ... echo FAILED` and the step exited 0 regardless.
Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4):
agent commit landed, CI reported `default=success`, but cluster was
unchanged. Log inside the step showed:
[servarr] Starting apply...
ERROR: Cannot read PG credentials from Vault.
Run: vault login -method=oidc
[servarr] FAILED (exit 1)
Two root causes, two fixes here.
### 1. Vault `ci` role lacks Tier-1 PG backend creds
The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses
the `pg-terraform-state` static DB role. `scripts/tg` reads it via
`vault read database/static-creds/pg-terraform-state`. That path is
permitted by the separate `terraform-state` Vault policy, which is
bound only to a role in namespace `claude-agent`. The CI runner is in
namespace `woodpecker` using role `ci`, whose policy grants only KV
+ K8s-creds + transit. Net: every Tier-1 stack apply from CI has
been dying at the PG-creds fetch since the migration.
**Fix**: attach `vault_policy.terraform_state` to
`vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new
policy needed — reuses the minimal one from 2026-04-16.
### 2. Apply-loop swallows stack failures
`.woodpecker/default.yml`'s platform + app apply loops use
`set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ]
&& echo FAILED` and then continue the while-loop. The step never
re-raises, so it exits 0 regardless of how many stacks failed.
**Fix**: accumulate failed stack names (excluding lock-skipped ones)
into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the
platform list to `.platform_failed` so it survives the step boundary,
and at the end of the app-stack step exit 1 if either list is
non-empty. Lock-skipped stacks remain non-fatal.
Together, (1) unblocks real apply and (2) ensures the Woodpecker
pipeline + the service-upgrade agent can both trust `default`
workflow state again.
## What is NOT in this change
- Re-running the qbittorrent upgrade to converge the cluster — the
TF file is already at 5.1.4 in git; once CI picks up this commit
it'll apply on its own, or Viktor can run `tg apply` locally now
that the ci role has access too.
- Retiring the `set +e ... continue` pattern entirely — keeping the
per-stack continuation so a single bad stack doesn't hide the
others' plans from the log. Just making the final status honest.
## Test Plan
### Automated
`terraform plan` / apply clean (Tier-0 via scripts/tg):
```
Plan: 0 to add, 2 to change, 0 to destroy.
# vault_kubernetes_auth_backend_role.ci will be updated in-place
~ token_policies = [
+ "terraform-state",
# (1 unchanged element hidden)
]
# vault_jwt_auth_backend.oidc will be updated in-place
~ tune = [...] # cosmetic provider-schema drift, pre-existing
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
```
State re-encrypted via `scripts/state-sync encrypt vault`; enc file
committed.
### Manual Verification
```
# Before (on previous commit — expect failure):
$ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c '
SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \
-d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token);
curl -s -H "X-Vault-Token: $TOK" \
http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state'
→ {"errors":["1 error occurred:\n\t* permission denied\n\n"]}
# After (this commit):
→ {"data":{"username":"terraform_state","password":"..."},...}
```
Pipeline-level: the next infra push will exercise
`.woodpecker/default.yml`; expected first push is this very commit.
Watch `ci.viktorbarzin.me` — the `default` workflow should either
succeed for real (and land actual changes) or exit 1 with
"=== FAILED STACKS ===" so the cause is visible.
Refs: bd code-e1x
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
2431c6d5fe
commit
2eca011cc3
4 changed files with 1918 additions and 1892 deletions
|
|
@ -123,6 +123,7 @@ steps:
|
|||
|
||||
# ── Apply platform stacks (serial, with Vault advisory locks) ──
|
||||
- |
|
||||
FAILED_PLATFORM_STACKS=""
|
||||
if [ -s .platform_apply ]; then
|
||||
echo "=== Applying platform stacks (serial, locked) ==="
|
||||
while read -r stack; do
|
||||
|
|
@ -137,6 +138,7 @@ steps:
|
|||
else
|
||||
echo "$OUTPUT" | tail -5
|
||||
echo "[$stack] FAILED (exit $EXIT)"
|
||||
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
|
||||
fi
|
||||
else
|
||||
echo "$OUTPUT" | tail -3
|
||||
|
|
@ -144,9 +146,12 @@ steps:
|
|||
fi
|
||||
done < .platform_apply
|
||||
fi
|
||||
# Deferred until after app stacks so both lists get a chance to run.
|
||||
echo "$FAILED_PLATFORM_STACKS" > .platform_failed
|
||||
|
||||
# ── Apply app stacks (serial, with Vault advisory locks) ──
|
||||
- |
|
||||
FAILED_APP_STACKS=""
|
||||
if [ -s .app_apply ]; then
|
||||
echo "=== Applying app stacks (serial, locked) ==="
|
||||
while read -r stack; do
|
||||
|
|
@ -161,6 +166,7 @@ steps:
|
|||
else
|
||||
echo "$OUTPUT" | tail -5
|
||||
echo "[$stack] FAILED (exit $EXIT)"
|
||||
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
|
||||
fi
|
||||
else
|
||||
echo "$OUTPUT" | tail -3
|
||||
|
|
@ -168,6 +174,15 @@ steps:
|
|||
fi
|
||||
done < .app_apply
|
||||
fi
|
||||
# Fail the step loudly so the pipeline `default` workflow state
|
||||
# reflects reality — the service-upgrade agent and CI alert cascade
|
||||
# both rely on this (see bd code-e1x). Lock-skipped stacks are NOT
|
||||
# counted as failures.
|
||||
FAILED_PLATFORM=$(cat .platform_failed 2>/dev/null | tr -d ' ')
|
||||
if [ -n "$FAILED_PLATFORM" ] || [ -n "$FAILED_APP_STACKS" ]; then
|
||||
echo "=== FAILED STACKS: platform=[$FAILED_PLATFORM ] apps=[$FAILED_APP_STACKS ] ==="
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ── Commit and push state changes ──
|
||||
- |
|
||||
|
|
|
|||
|
|
@ -394,9 +394,14 @@ resource "vault_kubernetes_auth_backend_role" "ci" {
|
|||
role_name = "ci"
|
||||
bound_service_account_names = ["default"]
|
||||
bound_service_account_namespaces = ["woodpecker"]
|
||||
token_policies = [vault_policy.ci.name]
|
||||
token_ttl = 604800 # 7d
|
||||
token_period = 604800 # periodic: auto-renews indefinitely
|
||||
# terraform_state policy grants `database/static-creds/pg-terraform-state`
|
||||
# read — scripts/tg needs this to fetch the Tier-1 PG backend password.
|
||||
# Without it, CI's per-stack `tg apply` dies with
|
||||
# `ERROR: Cannot read PG credentials from Vault` and the default.yml
|
||||
# apply-loop swallows the exit code (set +e) — fixed in bd code-e1x.
|
||||
token_policies = [vault_policy.ci.name, vault_policy.terraform_state.name]
|
||||
token_ttl = 604800 # 7d
|
||||
token_period = 604800 # periodic: auto-renews indefinitely
|
||||
}
|
||||
|
||||
# --- ESO Policy & Role ---
|
||||
|
|
|
|||
|
|
@ -9,6 +9,10 @@ terraform {
|
|||
source = "cloudflare/cloudflare"
|
||||
version = "~> 4"
|
||||
}
|
||||
authentik = {
|
||||
source = "goauthentik/authentik"
|
||||
version = "~> 2024.10"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
File diff suppressed because one or more lines are too long
Loading…
Add table
Add a link
Reference in a new issue