ci(infra): stop double-apply + stop counting PG lock-waits as failures
All checks were successful
ci/woodpecker/push/default Pipeline was successful

The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):

1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
   AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
   push. The two applies race each other for the per-stack PG state lock →
   "Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
   ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
   lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
   whole pipeline with no retry.

Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
  the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
  (they live on repo 1), so we de-dup the apply without deactivating the
  registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
  timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.

Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).

Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 11:37:18 +00:00
parent 69e35efd95
commit ec681ba6e1
2 changed files with 93 additions and 30 deletions

View file

@ -234,6 +234,38 @@ Woodpecker is **deploy + cluster-touching steps only**:
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
push**. Left unguarded, two `terragrunt apply` runs race each other for the
per-stack PG state lock — historically the #1 source of `Error acquiring the
state lock` failures and push-supersede "killed" runs.
- **Forge guard** (first command in the `apply` step): the push-apply runs **only
on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com`
skip. Fail-open (unknown forge still applies). The mirror keeps running the
**crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
have killed them.)
- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
locked`) — the PG case was previously miscounted as a hard failure.
- **Transient retry** (bounded, 3 attempts): only provider-registry download
timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
are NOT retried — they fail fast.
A pre-apply off-infra validate gate was evaluated and rejected: `terraform
validate` runs without state but catches ~0 of the observed failures (they are
provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
lock contention — all invisible to static validate), and `plan` cannot run
off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
phase without mutating on config errors, so a separate in-pipeline plan-gate was
also dropped as redundant.
### Woodpecker API
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths