ci(infra): stop double-apply + stop counting PG lock-waits as failures
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):
1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
push. The two applies race each other for the per-stack PG state lock →
"Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
whole pipeline with no retry.
Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
(they live on repo 1), so we de-dup the apply without deactivating the
registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.
Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).
Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
69e35efd95
commit
ec681ba6e1
2 changed files with 93 additions and 30 deletions
|
|
@ -234,6 +234,38 @@ Woodpecker is **deploy + cluster-touching steps only**:
|
|||
|
||||
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
|
||||
|
||||
### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
|
||||
|
||||
infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
|
||||
and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
|
||||
push**. Left unguarded, two `terragrunt apply` runs race each other for the
|
||||
per-stack PG state lock — historically the #1 source of `Error acquiring the
|
||||
state lock` failures and push-supersede "killed" runs.
|
||||
|
||||
- **Forge guard** (first command in the `apply` step): the push-apply runs **only
|
||||
on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
|
||||
and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com` →
|
||||
skip. Fail-open (unknown forge still applies). The mirror keeps running the
|
||||
**crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
|
||||
duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
|
||||
have killed them.)
|
||||
- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
|
||||
not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
|
||||
the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
|
||||
locked`) — the PG case was previously miscounted as a hard failure.
|
||||
- **Transient retry** (bounded, 3 attempts): only provider-registry download
|
||||
timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
|
||||
retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
|
||||
are NOT retried — they fail fast.
|
||||
|
||||
A pre-apply off-infra validate gate was evaluated and rejected: `terraform
|
||||
validate` runs without state but catches ~0 of the observed failures (they are
|
||||
provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
|
||||
lock contention — all invisible to static validate), and `plan` cannot run
|
||||
off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
|
||||
phase without mutating on config errors, so a separate in-pipeline plan-gate was
|
||||
also dropped as redundant.
|
||||
|
||||
### Woodpecker API
|
||||
|
||||
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue