|
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):
1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
push. The two applies race each other for the per-stack PG state lock →
"Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
whole pipeline with no retry.
Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
(they live on repo 1), so we de-dup the apply without deactivating the
registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.
Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).
Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| breakglass-infra-ci.yml | ||
| default.yml | ||
| drift-detection.yml | ||
| issue-automation.yml | ||
| postmortem-todos.yml | ||
| provision-user.yml | ||
| pve-nfs-exports-sync.yml | ||
| registry-config-sync.yml | ||
| renew-tls.yml | ||