ci: scripts/tg waits out a contended state lock (-lock-timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful

The infra CI pipeline was failing often — ~38% of the last 50 runs didn't
succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack
applies dying instantly with "Error acquiring the state lock".

Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline
skips a locked stack). Tier-1 stacks have no such fallback: they rely on
terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with
no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed
run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same
second), a human/agent applying locally, or the daily drift `plan`.

Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT)
on every state-locking verb (plan/apply/destroy/refresh), so a contended lock
WAITS for the holder to finish instead of failing. -auto-approve behaviour for
non-interactive applies is unchanged. Central wrapper change → covers CI, plus
local human/agent applies; no CI image rebuild (tg is read from the repo).

Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the
arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-21 00:15:39 +00:00
parent 9774ae3d19
commit 7bd4612edf
4 changed files with 129 additions and 17 deletions

View file

@ -13,6 +13,15 @@ export TF_PLUGIN_CACHE_DIR="${TF_PLUGIN_CACHE_DIR:-$HOME/.terraform.d/plugin-cac
export TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1
mkdir -p "$TF_PLUGIN_CACHE_DIR"
# State-lock wait window. Tier-1 stacks lock their state via terraform's pg
# backend (pg_advisory_lock); with no timeout an apply fails instantly
# ("Error acquiring the state lock") the moment anything else holds the lock —
# a Woodpecker-killed run whose lock PG hasn't reaped yet, a concurrent local
# apply, or the daily drift `plan`. Waiting a few minutes absorbs all of those
# (the holder finishes, or PG reaps the dead backend). This was the #1 cause of
# infra CI failures. Override with TG_LOCK_TIMEOUT (e.g. 0 to fail fast).
LOCK_TIMEOUT="${TG_LOCK_TIMEOUT:-5m}"
# Determine stack name from cwd (relative to stacks/)
STACK_NAME=""
cwd="$(pwd)"
@ -134,29 +143,30 @@ if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then
fi
fi
# If running apply with --non-interactive, add -auto-approve for Terraform
# Build the terragrunt invocation:
# - add -auto-approve right after `apply` for --non-interactive runs (CI)
# - add -lock-timeout for state-locking verbs (plan/apply/destroy/refresh) so
# a contended state lock WAITS instead of failing instantly (see
# LOCK_TIMEOUT above). Non-locking verbs (init/validate/output/fmt) skip it.
args=("$@")
has_apply=false
has_non_interactive=false
for arg in "${args[@]}"; do
case "$arg" in
apply) has_apply=true ;;
--non-interactive) has_non_interactive=true ;;
esac
done
if $has_apply && $has_non_interactive; then
new_args=()
for arg in "${args[@]}"; do
new_args+=("$arg")
if [ "$arg" = "apply" ]; then
new_args+=("-auto-approve")
fi
done
terragrunt "${new_args[@]}"
else
terragrunt "$@"
tg_args=()
for arg in "${args[@]}"; do
tg_args+=("$arg")
if [ "$arg" = "apply" ] && $has_non_interactive; then
tg_args+=("-auto-approve")
fi
done
if $is_tf_op; then
tg_args+=("-lock-timeout=$LOCK_TIMEOUT")
fi
terragrunt "${tg_args[@]}"
# After mutating operations: encrypt+commit (Tier 0) or no-op (Tier 1 — PG is authoritative)
if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then