ci: scripts/tg waits out a contended state lock (-lock-timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack applies dying instantly with "Error acquiring the state lock". Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline skips a locked stack). Tier-1 stacks have no such fallback: they rely on terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same second), a human/agent applying locally, or the daily drift `plan`. Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT) on every state-locking verb (plan/apply/destroy/refresh), so a contended lock WAITS for the holder to finish instead of failing. -auto-approve behaviour for non-interactive applies is unchanged. Central wrapper change → covers CI, plus local human/agent applies; no CI image rebuild (tg is read from the repo). Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
9774ae3d19
commit
7bd4612edf
4 changed files with 129 additions and 17 deletions
|
|
@ -9,7 +9,7 @@
|
|||
- **Ask before `git push`** — always confirm with the user first
|
||||
|
||||
## Execution
|
||||
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
|
||||
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
|
||||
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
|
||||
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
|
||||
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue