infra/.woodpecker
Viktor Barzin ec681ba6e1
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci(infra): stop double-apply + stop counting PG lock-waits as failures
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):

1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
   AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
   push. The two applies race each other for the per-stack PG state lock →
   "Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
   ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
   lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
   whole pipeline with no retry.

Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
  the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
  (they live on repo 1), so we de-dup the apply without deactivating the
  registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
  timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.

Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).

Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:37:18 +00:00
..
breakglass-infra-ci.yml ci: retire in-cluster infra-ci build; breakglass becomes manual ghcr pull-and-save (ADR-0002 #30) 2026-06-13 10:07:58 +00:00
default.yml ci(infra): stop double-apply + stop counting PG lock-waits as failures 2026-06-28 11:37:18 +00:00
drift-detection.yml ci(woodpecker): stop applying/planning the Tier-0 vault stack in CI 2026-06-27 15:48:41 +00:00
issue-automation.yml woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128 2026-06-19 09:06:44 +00:00
postmortem-todos.yml woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128 2026-06-19 09:06:44 +00:00
provision-user.yml woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128 2026-06-19 09:06:44 +00:00
pve-nfs-exports-sync.yml woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128 2026-06-19 09:06:44 +00:00
registry-config-sync.yml woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128 2026-06-19 09:06:44 +00:00
renew-tls.yml woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128 2026-06-19 09:06:44 +00:00