ci: scripts/tg waits out a contended state lock (-lock-timeout)

The infra CI pipeline was failing often — ~38% of the last 50 runs didn't succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack applies dying instantly with "Error acquiring the state lock". Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline skips a locked stack). Tier-1 stacks have no such fallback: they rely on terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same second), a human/agent applying locally, or the daily drift `plan`. Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT) on every state-locking verb (plan/apply/destroy/refresh), so a contended lock WAITS for the holder to finish instead of failing. -auto-approve behaviour for non-interactive applies is unchanged. Central wrapper change → covers CI, plus local human/agent applies; no CI image rebuild (tg is read from the repo). Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:39 +00:00 · 2026-06-21 00:15:39 +00:00 · 7bd4612edf
commit 7bd4612edf
parent 9774ae3d19
4 changed files with 129 additions and 17 deletions
--- a/scripts/test_tg_lock_timeout.py
+++ b/scripts/test_tg_lock_timeout.py
@ -0,0 +1,102 @@
+#!/usr/bin/env python3
+"""Tests for scripts/tg lock-timeout injection.
+
+scripts/tg wraps terragrunt. Tier-1 stacks rely on terraform's pg-backend
+state lock; without -lock-timeout an apply fails instantly ("Error acquiring
+the state lock") whenever anything else holds the lock — a Woodpecker-killed
+run whose PG advisory lock has not been reaped yet, a concurrent local apply,
+or the daily drift `plan`. This was the single largest cause of infra CI
+failures. These tests pin that tg injects -lock-timeout for state-locking
+verbs (and still preserves -auto-approve for non-interactive applies), so a
+contended lock waits rather than fails.
+
+Hermetic: a stub `terragrunt` on PATH records the args tg forwards; PG_CONN_STR
+is pre-set so the Tier-1 Vault credential fetch is skipped (no network/Vault).
+"""
+import os
+import shutil
+import subprocess
+from pathlib import Path
+
+import pytest
+
+SCRIPTS_DIR = Path(__file__).resolve().parent
+TG = SCRIPTS_DIR / "tg"
+AUTH_CHECK = SCRIPTS_DIR / "check-ingress-auth-comments.py"
+
+
+def _run(tmp_path, *tg_args, env_extra=None):
+    """Run a copy of scripts/tg in an isolated fake repo; return forwarded args."""
+    repo = tmp_path / "repo"
+    (repo / "scripts").mkdir(parents=True)
+    shutil.copy(TG, repo / "scripts" / "tg")
+    shutil.copy(AUTH_CHECK, repo / "scripts" / "check-ingress-auth-comments.py")
+    os.chmod(repo / "scripts" / "tg", 0o755)
+    os.chmod(repo / "scripts" / "check-ingress-auth-comments.py", 0o755)
+
+    # Fake Tier-1 stack ("faketest" is NOT in TIER0_STACKS), no ingress auth lines.
+    stack = repo / "stacks" / "faketest"
+    stack.mkdir(parents=True)
+    (stack / "terragrunt.hcl").write_text("# fake\n")
+    (stack / "main.tf").write_text("# no ingress_factory auth lines here\n")
+
+    # Stub terragrunt: append every forwarded arg (one per line) to a capture file.
+    bindir = tmp_path / "bin"
+    bindir.mkdir()
+    capture = tmp_path / "tg_args.txt"
+    stub = bindir / "terragrunt"
+    stub.write_text(
+        "#!/usr/bin/env bash\n"
+        f'for a in "$@"; do echo "$a" >> "{capture}"; done\n'
+        "exit 0\n"
+    )
+    os.chmod(stub, 0o755)
+
+    env = dict(os.environ)
+    env["PATH"] = f"{bindir}:{env['PATH']}"
+    env["PG_CONN_STR"] = "postgres://stub"  # skip the Tier-1 Vault cred fetch
+    env["TF_PLUGIN_CACHE_DIR"] = str(tmp_path / "plugin-cache")
+    if env_extra:
+        env.update(env_extra)
+
+    proc = subprocess.run(
+        ["bash", str(repo / "scripts" / "tg"), *tg_args],
+        cwd=str(stack),
+        env=env,
+        capture_output=True,
+        text=True,
+    )
+    assert proc.returncode == 0, f"tg exited {proc.returncode}\nSTDERR:\n{proc.stderr}\nSTDOUT:\n{proc.stdout}"
+    return capture.read_text().splitlines() if capture.exists() else []
+
+
+def test_apply_non_interactive_has_lock_timeout_and_auto_approve(tmp_path):
+    args = _run(tmp_path, "apply", "--non-interactive")
+    assert "apply" in args
+    assert "-auto-approve" in args, "non-interactive apply must keep -auto-approve"
+    assert "-lock-timeout=5m" in args, "apply must wait for a contended state lock"
+
+
+def test_plan_has_lock_timeout_but_not_auto_approve(tmp_path):
+    args = _run(tmp_path, "plan")
+    assert "plan" in args
+    assert "-lock-timeout=5m" in args
+    assert "-auto-approve" not in args, "plan must never get -auto-approve"
+
+
+@pytest.mark.parametrize("verb", ["destroy", "refresh"])
+def test_locking_verb_gets_lock_timeout(tmp_path, verb):
+    args = _run(tmp_path, verb)
+    assert "-lock-timeout=5m" in args, f"{verb} should carry -lock-timeout"
+
+
+def test_non_locking_verb_has_no_lock_timeout(tmp_path):
+    # validate does not take a state lock — must not carry -lock-timeout.
+    args = _run(tmp_path, "validate")
+    assert "validate" in args
+    assert not any(a.startswith("-lock-timeout") for a in args)
+
+
+def test_lock_timeout_is_env_overridable(tmp_path):
+    args = _run(tmp_path, "plan", env_extra={"TG_LOCK_TIMEOUT": "2m"})
+    assert "-lock-timeout=2m" in args
--- a/scripts/tg
+++ b/scripts/tg
@ -13,6 +13,15 @@ export TF_PLUGIN_CACHE_DIR="${TF_PLUGIN_CACHE_DIR:-$HOME/.terraform.d/plugin-cac
 export TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1
 mkdir -p "$TF_PLUGIN_CACHE_DIR"

+# State-lock wait window. Tier-1 stacks lock their state via terraform's pg
+# backend (pg_advisory_lock); with no timeout an apply fails instantly
+# ("Error acquiring the state lock") the moment anything else holds the lock —
+# a Woodpecker-killed run whose lock PG hasn't reaped yet, a concurrent local
+# apply, or the daily drift `plan`. Waiting a few minutes absorbs all of those
+# (the holder finishes, or PG reaps the dead backend). This was the #1 cause of
+# infra CI failures. Override with TG_LOCK_TIMEOUT (e.g. 0 to fail fast).
+LOCK_TIMEOUT="${TG_LOCK_TIMEOUT:-5m}"
+
 # Determine stack name from cwd (relative to stacks/)
 STACK_NAME=""
 cwd="$(pwd)"
@ -134,29 +143,30 @@ if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then
  fi
 fi

-# If running apply with --non-interactive, add -auto-approve for Terraform
+# Build the terragrunt invocation:
+#  - add -auto-approve right after `apply` for --non-interactive runs (CI)
+#  - add -lock-timeout for state-locking verbs (plan/apply/destroy/refresh) so
+#    a contended state lock WAITS instead of failing instantly (see
+#    LOCK_TIMEOUT above). Non-locking verbs (init/validate/output/fmt) skip it.
 args=("$@")
-has_apply=false
 has_non_interactive=false
 for arg in "${args[@]}"; do
  case "$arg" in
-    apply) has_apply=true ;;
    --non-interactive) has_non_interactive=true ;;
  esac
 done

-if $has_apply && $has_non_interactive; then
-  new_args=()
-  for arg in "${args[@]}"; do
-    new_args+=("$arg")
-    if [ "$arg" = "apply" ]; then
-      new_args+=("-auto-approve")
-    fi
-  done
-  terragrunt "${new_args[@]}"
-else
-  terragrunt "$@"
+tg_args=()
+for arg in "${args[@]}"; do
+  tg_args+=("$arg")
+  if [ "$arg" = "apply" ] && $has_non_interactive; then
+    tg_args+=("-auto-approve")
+  fi
+done
+if $is_tf_op; then
+  tg_args+=("-lock-timeout=$LOCK_TIMEOUT")
 fi
+terragrunt "${tg_args[@]}"

 # After mutating operations: encrypt+commit (Tier 0) or no-op (Tier 1 — PG is authoritative)
 if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then