ci: scripts/tg waits out a contended state lock (-lock-timeout)

The infra CI pipeline was failing often — ~38% of the last 50 runs didn't succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack applies dying instantly with "Error acquiring the state lock". Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline skips a locked stack). Tier-1 stacks have no such fallback: they rely on terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same second), a human/agent applying locally, or the daily drift `plan`. Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT) on every state-locking verb (plan/apply/destroy/refresh), so a contended lock WAITS for the holder to finish instead of failing. -auto-approve behaviour for non-interactive applies is unchanged. Central wrapper change → covers CI, plus local human/agent applies; no CI image rebuild (tg is read from the repo). Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:39 +00:00 · 2026-06-21 00:15:39 +00:00 · 7bd4612edf
commit 7bd4612edf
parent 9774ae3d19
4 changed files with 129 additions and 17 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -25,7 +25,7 @@ Violations cause state drift, which causes future applies to break or silently r

 ## Instructions
 - **"remember X"**: Use `memory-tool store "content" --category facts --tags "tag1,tag2"` (via exec) for persistent cross-session memory. Also update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]`. To recall: `memory-tool recall "query"`. To list: `memory-tool list`. To delete: `memory-tool delete <id>`. The native `memory_search` and `memory_get` tools are also available for searching indexed memory files. For **storing** new memories, always use the `memory-tool` CLI via exec.
- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies.
+- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies, and `-lock-timeout` (default `5m`, override via `TG_LOCK_TIMEOUT`) on every state-locking verb (`plan`/`apply`/`destroy`/`refresh`) so a contended state lock **waits** instead of failing instantly with `Error acquiring the state lock`.
 - **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma). CI = a GHA workflow on the repo's GitHub mirror (build + tests off-infra, ADR-0002); Woodpecker gets a deploy-only pipeline — never an in-cluster build.
 - **New service**: Use `setup-project` skill for full workflow
 - **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?":
@ -47,7 +47,7 @@ Violations cause state drift, which causes future applies to break or silently r

 ## Terraform State — Two-Tier Backend
 - **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable.
- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema.
+- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. **Lock contention is non-fatal**: `scripts/tg` passes `-lock-timeout` (default `5m`) so a contended lock waits rather than hard-failing — this was the #1 cause of infra CI failures (a Woodpecker-killed run's unreaped PG lock, a concurrent local apply, or the daily drift `plan`; Tier-1 stacks have no Vault advisory-lock skip to fall back on, unlike Tier-0).
 - **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`).
 - **Tier 0 workflow** (unchanged): `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`. State sync via SOPS is transparent.
 - **Tier 1 workflow**: `vault login -method=oidc` → `scripts/tg plan` → `scripts/tg apply`. No git commit needed — PG is authoritative.