ci(woodpecker): stop applying/planning the Tier-0 vault stack in CI
All checks were successful
ci/woodpecker/push/default Pipeline was successful

The nightly drift-detection cron and every vault-touching push apply have
been failing because CI runs terragrunt plan/apply on the Tier-0 `vault`
stack, which manages Vault's own transit mount + ACL policies. The CI
`ci` Vault role intentionally lacks those admin perms (sys/mounts,
sys/policies/acl), so the run always errors:
  - apply: 403 on vault_mount.transit + vault_policy.personal_emo, plus an
    Invalid for_each (local.k8s_users from secret/platform is deferred)
  - drift: terragrunt plan exits 1 → fails the whole nightly run

vault is Tier-0 = human-applied via OIDC. Skip it in both pipelines:
- default.yml: skip `vault` in the platform-apply loop (kept in
  PLATFORM_STACKS so the app-stack detector still excludes it)
- drift-detection.yml: skip `vault` in the per-stack plan loop
- ci-cd.md: document the exclusion on both pipeline rows

Found during a CI-health sweep (user reported many failures): GitHub
Actions all green; all Woodpecker repos green except this recurring
infra-repo failure, doubled by the legacy repo-1 + repo-82 dual
registration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-27 15:48:20 +00:00
parent 81c2b14e29
commit bbc797b30e
3 changed files with 15 additions and 2 deletions

View file

@ -213,6 +213,12 @@ steps:
if [ -s .platform_apply ]; then
echo "=== Applying platform stacks (serial, locked) ==="
while read -r stack; do
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
# lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
# apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
# (so the app-stack detector still excludes it) but skipped here.
# (2026-06-27 — see docs/architecture/ci-cd.md)
if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
echo "[$stack] Starting apply..."
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)

View file

@ -85,6 +85,13 @@ steps:
stack=$(basename "$stack_dir")
[ -f "$stack_dir/terragrunt.hcl" ] || continue
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
# Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
# on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
# run. Skip it — drift on Tier-0 vault is caught at human apply time.
# (2026-06-27)
[ "$stack" = "vault" ] && continue
echo -n "[$stack] planning... "
OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
EXIT=$?

View file

@ -174,9 +174,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
| Pipeline | File | Purpose |
|----------|------|---------|
| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*``10.0.20.10` on change |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE |