ci(infra): stop double-apply + stop counting PG lock-waits as failures

The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going red ~20% of the time. Root causes (verified from the failure logs, not guessed): 1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82) AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every push. The two applies race each other for the per-stack PG state lock → "Error acquiring the state lock" failures + push-supersede "killed" runs. 2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state lock") fell through and was counted as a hard FAILURE. 3. Transient provider-registry download timeouts (and Vault 5xx) failed the whole pipeline with no retry. Fixes (all in default.yml): - Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons (they live on repo 1), so we de-dup the apply without deactivating the registration. Fail-open on unknown forge. - Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED. - Bounded retry (3x) ONLY on transient signatures (provider download timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast. Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock failures; reproduced `terraform validate` passing the exact stacks that fail at apply) and lock-reaping/force-unlock (PG advisory locks are session-scoped + auto-release; force-unlock can't free them and would corrupt a live concurrent apply). Shell logic + the classification regexes were unit-tested locally against the real decoded error strings (#359 PG lock, #353 provider timeout, #360 missing-arg, helm atomic timeout); `bash -n` clean; YAML parses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:37:18 +00:00 · 2026-06-28 11:37:18 +00:00 · ec681ba6e1
commit ec681ba6e1
parent 69e35efd95
2 changed files with 93 additions and 30 deletions
--- a/docs/architecture/ci-cd.md
+++ b/docs/architecture/ci-cd.md
@ -234,6 +234,38 @@ Woodpecker is **deploy + cluster-touching steps only**:

 **No build/test pipeline exists on any repo.** Do not (re)introduce one.

+### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
+
+infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
+and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
+push**. Left unguarded, two `terragrunt apply` runs race each other for the
+per-stack PG state lock — historically the #1 source of `Error acquiring the
+state lock` failures and push-supersede "killed" runs.
+
+- **Forge guard** (first command in the `apply` step): the push-apply runs **only
+  on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
+  and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com` →
+  skip. Fail-open (unknown forge still applies). The mirror keeps running the
+  **crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
+  duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
+  have killed them.)
+- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
+  not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
+  the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
+  locked`) — the PG case was previously miscounted as a hard failure.
+- **Transient retry** (bounded, 3 attempts): only provider-registry download
+  timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
+  retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
+  are NOT retried — they fail fast.
+
+A pre-apply off-infra validate gate was evaluated and rejected: `terraform
+validate` runs without state but catches ~0 of the observed failures (they are
+provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
+lock contention — all invisible to static validate), and `plan` cannot run
+off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
+phase without mutating on config errors, so a separate in-pipeline plan-gate was
+also dropped as redundant.
+
 ### Woodpecker API

 Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths