ci(infra): stop double-apply + stop counting PG lock-waits as failures
All checks were successful
ci/woodpecker/push/default Pipeline was successful

The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):

1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
   AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
   push. The two applies race each other for the per-stack PG state lock →
   "Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
   ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
   lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
   whole pipeline with no retry.

Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
  the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
  (they live on repo 1), so we de-dup the apply without deactivating the
  registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
  timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.

Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).

Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 11:37:18 +00:00
parent 69e35efd95
commit ec681ba6e1
2 changed files with 93 additions and 30 deletions

View file

@ -65,6 +65,21 @@ steps:
# don't need explicit token propagation.
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
commands:
# ── Forge guard: apply ONLY on the canonical Forgejo forge ──
# infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
# the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
# guard both run `terragrunt apply` on every push and race each other for
# the per-stack PG state lock — the dominant cause of the "Error acquiring
# the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
# registration keeps running the CRONS (drift-detection, renew-tls, …) — only
# its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
# env var set) still applies, preserving prior behaviour.
- |
if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
exit 0
fi
# ── Skip CI commits ──
- |
if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
@ -220,22 +235,33 @@ steps:
# (2026-06-27 — see docs/architecture/ci-cd.md)
if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
echo "[$stack] Starting apply..."
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
fi
# Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
# ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
# ("Error acquiring the state lock" / "already locked"). The PG case
# was previously counted as a failure — the #1 source of false reds.
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
fi
# Transient: provider-registry download timeout / Vault 5xx → bounded
# retry. Deliberately NOT helm atomic-timeouts or config errors
# (missing arg, invalid index) — those must fail fast, retry can't fix
# them and can worsen a stuck helm release.
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
done
done < .platform_apply
fi
# Deferred until after app stacks so both lists get a chance to run.
@ -248,22 +274,27 @@ steps:
echo "=== Applying app stacks (serial, locked) ==="
while read -r stack; do
echo "[$stack] Starting apply..."
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
fi
# Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
fi
# Transient provider-download / Vault 5xx → bounded retry (see platform loop).
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
done
done < .app_apply
fi
# Fail the step loudly so the pipeline `default` workflow state