ci: scripts/tg waits out a contended state lock (-lock-timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack applies dying instantly with "Error acquiring the state lock". Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline skips a locked stack). Tier-1 stacks have no such fallback: they rely on terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same second), a human/agent applying locally, or the daily drift `plan`. Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT) on every state-locking verb (plan/apply/destroy/refresh), so a contended lock WAITS for the holder to finish instead of failing. -auto-approve behaviour for non-interactive applies is unchanged. Central wrapper change → covers CI, plus local human/agent applies; no CI image rebuild (tg is read from the repo). Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
9774ae3d19
commit
7bd4612edf
4 changed files with 129 additions and 17 deletions
|
|
@ -25,7 +25,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
- **"remember X"**: Use `memory-tool store "content" --category facts --tags "tag1,tag2"` (via exec) for persistent cross-session memory. Also update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]`. To recall: `memory-tool recall "query"`. To list: `memory-tool list`. To delete: `memory-tool delete <id>`. The native `memory_search` and `memory_get` tools are also available for searching indexed memory files. For **storing** new memories, always use the `memory-tool` CLI via exec.
|
- **"remember X"**: Use `memory-tool store "content" --category facts --tags "tag1,tag2"` (via exec) for persistent cross-session memory. Also update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]`. To recall: `memory-tool recall "query"`. To list: `memory-tool list`. To delete: `memory-tool delete <id>`. The native `memory_search` and `memory_get` tools are also available for searching indexed memory files. For **storing** new memories, always use the `memory-tool` CLI via exec.
|
||||||
- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies.
|
- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies, and `-lock-timeout` (default `5m`, override via `TG_LOCK_TIMEOUT`) on every state-locking verb (`plan`/`apply`/`destroy`/`refresh`) so a contended state lock **waits** instead of failing instantly with `Error acquiring the state lock`.
|
||||||
- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma). CI = a GHA workflow on the repo's GitHub mirror (build + tests off-infra, ADR-0002); Woodpecker gets a deploy-only pipeline — never an in-cluster build.
|
- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma). CI = a GHA workflow on the repo's GitHub mirror (build + tests off-infra, ADR-0002); Woodpecker gets a deploy-only pipeline — never an in-cluster build.
|
||||||
- **New service**: Use `setup-project` skill for full workflow
|
- **New service**: Use `setup-project` skill for full workflow
|
||||||
- **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?":
|
- **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?":
|
||||||
|
|
@ -47,7 +47,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
||||||
|
|
||||||
## Terraform State — Two-Tier Backend
|
## Terraform State — Two-Tier Backend
|
||||||
- **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable.
|
- **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable.
|
||||||
- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema.
|
- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. **Lock contention is non-fatal**: `scripts/tg` passes `-lock-timeout` (default `5m`) so a contended lock waits rather than hard-failing — this was the #1 cause of infra CI failures (a Woodpecker-killed run's unreaped PG lock, a concurrent local apply, or the daily drift `plan`; Tier-1 stacks have no Vault advisory-lock skip to fall back on, unlike Tier-0).
|
||||||
- **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`).
|
- **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`).
|
||||||
- **Tier 0 workflow** (unchanged): `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`. State sync via SOPS is transparent.
|
- **Tier 0 workflow** (unchanged): `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`. State sync via SOPS is transparent.
|
||||||
- **Tier 1 workflow**: `vault login -method=oidc` → `scripts/tg plan` → `scripts/tg apply`. No git commit needed — PG is authoritative.
|
- **Tier 1 workflow**: `vault login -method=oidc` → `scripts/tg plan` → `scripts/tg apply`. No git commit needed — PG is authoritative.
|
||||||
|
|
|
||||||
|
|
@ -9,7 +9,7 @@
|
||||||
- **Ask before `git push`** — always confirm with the user first
|
- **Ask before `git push`** — always confirm with the user first
|
||||||
|
|
||||||
## Execution
|
## Execution
|
||||||
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
|
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
|
||||||
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
|
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
|
||||||
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
|
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
|
||||||
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
|
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
|
||||||
|
|
|
||||||
102
scripts/test_tg_lock_timeout.py
Normal file
102
scripts/test_tg_lock_timeout.py
Normal file
|
|
@ -0,0 +1,102 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Tests for scripts/tg lock-timeout injection.
|
||||||
|
|
||||||
|
scripts/tg wraps terragrunt. Tier-1 stacks rely on terraform's pg-backend
|
||||||
|
state lock; without -lock-timeout an apply fails instantly ("Error acquiring
|
||||||
|
the state lock") whenever anything else holds the lock — a Woodpecker-killed
|
||||||
|
run whose PG advisory lock has not been reaped yet, a concurrent local apply,
|
||||||
|
or the daily drift `plan`. This was the single largest cause of infra CI
|
||||||
|
failures. These tests pin that tg injects -lock-timeout for state-locking
|
||||||
|
verbs (and still preserves -auto-approve for non-interactive applies), so a
|
||||||
|
contended lock waits rather than fails.
|
||||||
|
|
||||||
|
Hermetic: a stub `terragrunt` on PATH records the args tg forwards; PG_CONN_STR
|
||||||
|
is pre-set so the Tier-1 Vault credential fetch is skipped (no network/Vault).
|
||||||
|
"""
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
SCRIPTS_DIR = Path(__file__).resolve().parent
|
||||||
|
TG = SCRIPTS_DIR / "tg"
|
||||||
|
AUTH_CHECK = SCRIPTS_DIR / "check-ingress-auth-comments.py"
|
||||||
|
|
||||||
|
|
||||||
|
def _run(tmp_path, *tg_args, env_extra=None):
|
||||||
|
"""Run a copy of scripts/tg in an isolated fake repo; return forwarded args."""
|
||||||
|
repo = tmp_path / "repo"
|
||||||
|
(repo / "scripts").mkdir(parents=True)
|
||||||
|
shutil.copy(TG, repo / "scripts" / "tg")
|
||||||
|
shutil.copy(AUTH_CHECK, repo / "scripts" / "check-ingress-auth-comments.py")
|
||||||
|
os.chmod(repo / "scripts" / "tg", 0o755)
|
||||||
|
os.chmod(repo / "scripts" / "check-ingress-auth-comments.py", 0o755)
|
||||||
|
|
||||||
|
# Fake Tier-1 stack ("faketest" is NOT in TIER0_STACKS), no ingress auth lines.
|
||||||
|
stack = repo / "stacks" / "faketest"
|
||||||
|
stack.mkdir(parents=True)
|
||||||
|
(stack / "terragrunt.hcl").write_text("# fake\n")
|
||||||
|
(stack / "main.tf").write_text("# no ingress_factory auth lines here\n")
|
||||||
|
|
||||||
|
# Stub terragrunt: append every forwarded arg (one per line) to a capture file.
|
||||||
|
bindir = tmp_path / "bin"
|
||||||
|
bindir.mkdir()
|
||||||
|
capture = tmp_path / "tg_args.txt"
|
||||||
|
stub = bindir / "terragrunt"
|
||||||
|
stub.write_text(
|
||||||
|
"#!/usr/bin/env bash\n"
|
||||||
|
f'for a in "$@"; do echo "$a" >> "{capture}"; done\n'
|
||||||
|
"exit 0\n"
|
||||||
|
)
|
||||||
|
os.chmod(stub, 0o755)
|
||||||
|
|
||||||
|
env = dict(os.environ)
|
||||||
|
env["PATH"] = f"{bindir}:{env['PATH']}"
|
||||||
|
env["PG_CONN_STR"] = "postgres://stub" # skip the Tier-1 Vault cred fetch
|
||||||
|
env["TF_PLUGIN_CACHE_DIR"] = str(tmp_path / "plugin-cache")
|
||||||
|
if env_extra:
|
||||||
|
env.update(env_extra)
|
||||||
|
|
||||||
|
proc = subprocess.run(
|
||||||
|
["bash", str(repo / "scripts" / "tg"), *tg_args],
|
||||||
|
cwd=str(stack),
|
||||||
|
env=env,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
assert proc.returncode == 0, f"tg exited {proc.returncode}\nSTDERR:\n{proc.stderr}\nSTDOUT:\n{proc.stdout}"
|
||||||
|
return capture.read_text().splitlines() if capture.exists() else []
|
||||||
|
|
||||||
|
|
||||||
|
def test_apply_non_interactive_has_lock_timeout_and_auto_approve(tmp_path):
|
||||||
|
args = _run(tmp_path, "apply", "--non-interactive")
|
||||||
|
assert "apply" in args
|
||||||
|
assert "-auto-approve" in args, "non-interactive apply must keep -auto-approve"
|
||||||
|
assert "-lock-timeout=5m" in args, "apply must wait for a contended state lock"
|
||||||
|
|
||||||
|
|
||||||
|
def test_plan_has_lock_timeout_but_not_auto_approve(tmp_path):
|
||||||
|
args = _run(tmp_path, "plan")
|
||||||
|
assert "plan" in args
|
||||||
|
assert "-lock-timeout=5m" in args
|
||||||
|
assert "-auto-approve" not in args, "plan must never get -auto-approve"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("verb", ["destroy", "refresh"])
|
||||||
|
def test_locking_verb_gets_lock_timeout(tmp_path, verb):
|
||||||
|
args = _run(tmp_path, verb)
|
||||||
|
assert "-lock-timeout=5m" in args, f"{verb} should carry -lock-timeout"
|
||||||
|
|
||||||
|
|
||||||
|
def test_non_locking_verb_has_no_lock_timeout(tmp_path):
|
||||||
|
# validate does not take a state lock — must not carry -lock-timeout.
|
||||||
|
args = _run(tmp_path, "validate")
|
||||||
|
assert "validate" in args
|
||||||
|
assert not any(a.startswith("-lock-timeout") for a in args)
|
||||||
|
|
||||||
|
|
||||||
|
def test_lock_timeout_is_env_overridable(tmp_path):
|
||||||
|
args = _run(tmp_path, "plan", env_extra={"TG_LOCK_TIMEOUT": "2m"})
|
||||||
|
assert "-lock-timeout=2m" in args
|
||||||
38
scripts/tg
38
scripts/tg
|
|
@ -13,6 +13,15 @@ export TF_PLUGIN_CACHE_DIR="${TF_PLUGIN_CACHE_DIR:-$HOME/.terraform.d/plugin-cac
|
||||||
export TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1
|
export TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1
|
||||||
mkdir -p "$TF_PLUGIN_CACHE_DIR"
|
mkdir -p "$TF_PLUGIN_CACHE_DIR"
|
||||||
|
|
||||||
|
# State-lock wait window. Tier-1 stacks lock their state via terraform's pg
|
||||||
|
# backend (pg_advisory_lock); with no timeout an apply fails instantly
|
||||||
|
# ("Error acquiring the state lock") the moment anything else holds the lock —
|
||||||
|
# a Woodpecker-killed run whose lock PG hasn't reaped yet, a concurrent local
|
||||||
|
# apply, or the daily drift `plan`. Waiting a few minutes absorbs all of those
|
||||||
|
# (the holder finishes, or PG reaps the dead backend). This was the #1 cause of
|
||||||
|
# infra CI failures. Override with TG_LOCK_TIMEOUT (e.g. 0 to fail fast).
|
||||||
|
LOCK_TIMEOUT="${TG_LOCK_TIMEOUT:-5m}"
|
||||||
|
|
||||||
# Determine stack name from cwd (relative to stacks/)
|
# Determine stack name from cwd (relative to stacks/)
|
||||||
STACK_NAME=""
|
STACK_NAME=""
|
||||||
cwd="$(pwd)"
|
cwd="$(pwd)"
|
||||||
|
|
@ -134,29 +143,30 @@ if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# If running apply with --non-interactive, add -auto-approve for Terraform
|
# Build the terragrunt invocation:
|
||||||
|
# - add -auto-approve right after `apply` for --non-interactive runs (CI)
|
||||||
|
# - add -lock-timeout for state-locking verbs (plan/apply/destroy/refresh) so
|
||||||
|
# a contended state lock WAITS instead of failing instantly (see
|
||||||
|
# LOCK_TIMEOUT above). Non-locking verbs (init/validate/output/fmt) skip it.
|
||||||
args=("$@")
|
args=("$@")
|
||||||
has_apply=false
|
|
||||||
has_non_interactive=false
|
has_non_interactive=false
|
||||||
for arg in "${args[@]}"; do
|
for arg in "${args[@]}"; do
|
||||||
case "$arg" in
|
case "$arg" in
|
||||||
apply) has_apply=true ;;
|
|
||||||
--non-interactive) has_non_interactive=true ;;
|
--non-interactive) has_non_interactive=true ;;
|
||||||
esac
|
esac
|
||||||
done
|
done
|
||||||
|
|
||||||
if $has_apply && $has_non_interactive; then
|
tg_args=()
|
||||||
new_args=()
|
for arg in "${args[@]}"; do
|
||||||
for arg in "${args[@]}"; do
|
tg_args+=("$arg")
|
||||||
new_args+=("$arg")
|
if [ "$arg" = "apply" ] && $has_non_interactive; then
|
||||||
if [ "$arg" = "apply" ]; then
|
tg_args+=("-auto-approve")
|
||||||
new_args+=("-auto-approve")
|
fi
|
||||||
fi
|
done
|
||||||
done
|
if $is_tf_op; then
|
||||||
terragrunt "${new_args[@]}"
|
tg_args+=("-lock-timeout=$LOCK_TIMEOUT")
|
||||||
else
|
|
||||||
terragrunt "$@"
|
|
||||||
fi
|
fi
|
||||||
|
terragrunt "${tg_args[@]}"
|
||||||
|
|
||||||
# After mutating operations: encrypt+commit (Tier 0) or no-op (Tier 1 — PG is authoritative)
|
# After mutating operations: encrypt+commit (Tier 0) or no-op (Tier 1 — PG is authoritative)
|
||||||
if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then
|
if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue