All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):
1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
push. The two applies race each other for the per-stack PG state lock →
"Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
whole pipeline with no retry.
Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
(they live on repo 1), so we de-dup the apply without deactivating the
registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.
Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).
Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
347 lines
17 KiB
YAML
347 lines
17 KiB
YAML
# Unified infra CI pipeline — detects changed stacks and applies only those.
|
|
# Platform stacks and app stacks handled in one pipeline with proper ordering.
|
|
#
|
|
# Optimizations over the previous split pipeline:
|
|
# - Custom CI image (no apk/wget per step)
|
|
# - Shallow clone (depth=2 for git diff HEAD~1)
|
|
# - TF_PLUGIN_CACHE_DIR (shared provider cache)
|
|
# - Serial apply with Vault advisory locks (prevents user/CI race conditions)
|
|
# - Step consolidation (2 steps instead of 4)
|
|
# - Changed-stacks-only detection (skips no-op applies)
|
|
# - Global-file fallback (modules/config changes trigger full apply)
|
|
# - Lock-aware: skips stacks locked by users instead of failing
|
|
|
|
when:
|
|
event: push
|
|
branch: master
|
|
|
|
clone:
|
|
git:
|
|
image: woodpeckerci/plugin-git
|
|
settings:
|
|
partial: false
|
|
depth: 2
|
|
attempts: 5
|
|
backoff: 10s
|
|
|
|
steps:
|
|
# Audit feed for the allow-then-audit contribution model: any master push by
|
|
# a NON-admin author is surfaced in Slack (Viktor's own pushes are not).
|
|
# Runs before apply and never blocks it. Note: [ci skip] commits never reach
|
|
# this step (Woodpecker skips the whole pipeline) — hence the rule that
|
|
# non-admins must not use [ci skip].
|
|
- name: notify-nonadmin-push
|
|
image: curlimages/curl
|
|
environment:
|
|
SLACK_WEBHOOK:
|
|
from_secret: slack_webhook
|
|
commands:
|
|
- |
|
|
case "$CI_COMMIT_AUTHOR" in
|
|
viktor|ViktorBarzin|wizard) echo "admin push — no notify"; exit 0 ;;
|
|
esac
|
|
SUBJECT=$(echo "$CI_COMMIT_MESSAGE" | head -1 | tr -d '"\\')
|
|
curl -s -X POST -H 'Content-type: application/json' \
|
|
--data "{\"text\":\"📝 infra master push by *$CI_COMMIT_AUTHOR*: $SUBJECT\n$CI_REPO_URL/commit/$CI_COMMIT_SHA\"}" \
|
|
"$SLACK_WEBHOOK" || true
|
|
|
|
- name: apply
|
|
image: ghcr.io/viktorbarzin/infra-ci:latest
|
|
pull: true
|
|
backend_options:
|
|
kubernetes:
|
|
resources:
|
|
requests:
|
|
memory: 3Gi
|
|
limits:
|
|
memory: 6Gi
|
|
environment:
|
|
SLACK_WEBHOOK:
|
|
from_secret: slack_webhook
|
|
# Each `- |` command runs in a fresh shell, so we can't rely on an
|
|
# `export VAULT_ADDR=...` in the auth command persisting — pin it at
|
|
# step level. VAULT_TOKEN is still per-command; we persist it to
|
|
# ~/.vault-token (auto-read by `vault` CLI) so downstream commands
|
|
# don't need explicit token propagation.
|
|
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
|
|
commands:
|
|
# ── Forge guard: apply ONLY on the canonical Forgejo forge ──
|
|
# infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
|
|
# the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
|
|
# guard both run `terragrunt apply` on every push and race each other for
|
|
# the per-stack PG state lock — the dominant cause of the "Error acquiring
|
|
# the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
|
|
# registration keeps running the CRONS (drift-detection, renew-tls, …) — only
|
|
# its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
|
|
# env var set) still applies, preserving prior behaviour.
|
|
- |
|
|
if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
|
|
echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
|
|
exit 0
|
|
fi
|
|
|
|
# ── Skip CI commits ──
|
|
- |
|
|
if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
|
|
echo "Commit has [CI SKIP], exiting"
|
|
exit 0
|
|
fi
|
|
|
|
# ── git-crypt unlock ──
|
|
- |
|
|
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
|
|
curl -sk "https://10.0.20.100:6443/api/v1/namespaces/woodpecker/configmaps/git-crypt-key" \
|
|
-H "Authorization:Bearer $SA_TOKEN" | jq -r .data.key | base64 -d > /tmp/key
|
|
git-crypt unlock /tmp/key && rm /tmp/key
|
|
|
|
# ── Vault auth ──
|
|
- |
|
|
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
|
|
VAULT_TOKEN=$(curl -s -X POST "$VAULT_ADDR/v1/auth/kubernetes/login" \
|
|
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}" | jq -r .auth.client_token)
|
|
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
|
echo "ERROR: Vault K8s auth failed (role=ci, ns=woodpecker)" >&2
|
|
exit 1
|
|
fi
|
|
# Persist for downstream `- |` blocks (each runs in a fresh shell,
|
|
# so exporting VAULT_TOKEN wouldn't help). `vault`, `scripts/tg`,
|
|
# and `scripts/state-sync` all fall through to ~/.vault-token when
|
|
# the env var is unset.
|
|
umask 077; printf '%s' "$VAULT_TOKEN" > "$HOME/.vault-token"
|
|
|
|
# ── Generate kubeconfig from projected SA token ──
|
|
# terragrunt.hcl injects `-var kube_config_path=<repo>/config` for every
|
|
# terraform invocation, so we need a kubeconfig file at that path. The
|
|
# `default` SA in the woodpecker namespace is cluster-admin (via the
|
|
# `woodpecker-default` ClusterRoleBinding), so the projected token is
|
|
# sufficient to apply any stack. Using `tokenFile` (not an inline token)
|
|
# so the provider re-reads it if kubelet rotates the projected token
|
|
# mid-pipeline.
|
|
- |
|
|
cat > config <<'EOF'
|
|
apiVersion: v1
|
|
kind: Config
|
|
clusters:
|
|
- name: kubernetes
|
|
cluster:
|
|
server: https://10.0.20.100:6443
|
|
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
|
contexts:
|
|
- name: ci
|
|
context:
|
|
cluster: kubernetes
|
|
user: ci
|
|
current-context: ci
|
|
users:
|
|
- name: ci
|
|
user:
|
|
tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
|
EOF
|
|
chmod 600 config
|
|
# Sanity check: kubeconfig works
|
|
kubectl --kubeconfig=config get ns kube-system -o name >/dev/null
|
|
|
|
# ── Detect changed stacks ──
|
|
- |
|
|
PLATFORM_STACKS="dbaas authentik crowdsec monitoring nvidia mailserver cloudflared kyverno metallb redis traefik technitium headscale rbac k8s-portal vaultwarden reverse-proxy metrics-server vpa nfs-csi iscsi-csi cnpg sealed-secrets uptime-kuma wireguard xray infra-maintenance platform vault reloader descheduler external-secrets"
|
|
|
|
# Ensure we have enough history for diff (clone may be shallow)
|
|
if ! git rev-parse HEAD~1 >/dev/null 2>&1; then
|
|
echo "WARNING: HEAD~1 not available (shallow clone?) — fetching more history"
|
|
git fetch --deepen=1 origin master 2>/dev/null || true
|
|
fi
|
|
|
|
# Diff base: prefer the push's true before-state (CI_PREV_COMMIT_SHA).
|
|
# HEAD~1 is WRONG for merge commits — it is the first parent (the
|
|
# feature-branch side), so the diff shows the OTHER lineage's files
|
|
# and silently skips the stacks this push actually changed
|
|
# (bit ci-pipeline-health on 2026-06-12, pipeline 128).
|
|
DIFF_BASE="HEAD~1"
|
|
if [ -n "${CI_PREV_COMMIT_SHA:-}" ] && [ "$CI_PREV_COMMIT_SHA" != "$CI_COMMIT_SHA" ]; then
|
|
git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null || git fetch --depth=50 origin master 2>/dev/null || true
|
|
# Restarted pipelines after master moved produce REVERSE diffs
|
|
# (CI_PREV ahead of the checked-out HEAD re-applied stale trees and
|
|
# reverted a sibling apply on 2026-06-12, pipeline 148). Only use
|
|
# CI_PREV when it is an ancestor of HEAD.
|
|
if git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null \
|
|
&& git merge-base --is-ancestor "$CI_PREV_COMMIT_SHA" HEAD 2>/dev/null; then
|
|
DIFF_BASE="$CI_PREV_COMMIT_SHA"
|
|
fi
|
|
fi
|
|
echo "Diff base: $DIFF_BASE"
|
|
|
|
# If still no parent, apply all platform stacks as a safe fallback
|
|
if ! git rev-parse HEAD~1 >/dev/null 2>&1; then
|
|
echo "Cannot determine changed files — applying ALL platform stacks"
|
|
echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply
|
|
> .app_apply
|
|
else
|
|
# Check if global files changed (triggers full platform apply)
|
|
GLOBAL_CHANGED=$(git diff --name-only "$DIFF_BASE" HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true)
|
|
|
|
if [ -n "$GLOBAL_CHANGED" ]; then
|
|
echo "Global files changed — applying ALL platform stacks"
|
|
echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply
|
|
else
|
|
# Detect platform stacks that changed
|
|
git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed
|
|
> .platform_apply
|
|
while read -r stack; do
|
|
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
|
|
echo "$stack" >> .platform_apply
|
|
fi
|
|
done < .all_changed
|
|
fi
|
|
|
|
# Detect app stacks that changed
|
|
> .app_apply
|
|
git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do
|
|
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
|
|
continue # Skip platform stacks
|
|
fi
|
|
if [ ! -f "stacks/$stack/terragrunt.hcl" ]; then
|
|
continue # Skip non-terragrunt dirs
|
|
fi
|
|
echo "$stack" >> .app_apply
|
|
done
|
|
fi
|
|
|
|
PLATFORM_COUNT=$(wc -l < .platform_apply | tr -d ' ')
|
|
APP_COUNT=$(wc -l < .app_apply | tr -d ' ')
|
|
echo "Platform stacks to apply: $PLATFORM_COUNT"
|
|
echo "App stacks to apply: $APP_COUNT"
|
|
cat .platform_apply .app_apply
|
|
|
|
# ── Pre-warm provider cache ──
|
|
- |
|
|
if [ -s .platform_apply ] || [ -s .app_apply ]; then
|
|
FIRST_STACK=$(cat .platform_apply .app_apply 2>/dev/null | head -1)
|
|
if [ -n "$FIRST_STACK" ]; then
|
|
echo "Pre-warming provider cache from stacks/$FIRST_STACK..."
|
|
cd "stacks/$FIRST_STACK" && terragrunt init --terragrunt-non-interactive -input=false 2>&1 | tail -3 && cd ../..
|
|
fi
|
|
fi
|
|
|
|
# ── Apply platform stacks (serial, with Vault advisory locks) ──
|
|
- |
|
|
FAILED_PLATFORM_STACKS=""
|
|
if [ -s .platform_apply ]; then
|
|
echo "=== Applying platform stacks (serial, locked) ==="
|
|
while read -r stack; do
|
|
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
|
|
# lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
|
|
# apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
|
|
# (so the app-stack detector still excludes it) but skipped here.
|
|
# (2026-06-27 — see docs/architecture/ci-cd.md)
|
|
if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
|
|
echo "[$stack] Starting apply..."
|
|
ATTEMPT=0
|
|
while :; do
|
|
ATTEMPT=$((ATTEMPT + 1))
|
|
set +e
|
|
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
|
|
EXIT=$?
|
|
set -e
|
|
if [ $EXIT -eq 0 ]; then
|
|
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
|
|
fi
|
|
# Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
|
|
# ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
|
|
# ("Error acquiring the state lock" / "already locked"). The PG case
|
|
# was previously counted as a failure — the #1 source of false reds.
|
|
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
|
|
echo "[$stack] SKIPPED (locked by another session/run)"; break
|
|
fi
|
|
# Transient: provider-registry download timeout / Vault 5xx → bounded
|
|
# retry. Deliberately NOT helm atomic-timeouts or config errors
|
|
# (missing arg, invalid index) — those must fail fast, retry can't fix
|
|
# them and can worsen a stuck helm release.
|
|
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
|
|
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
|
|
fi
|
|
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
|
|
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
|
|
done
|
|
done < .platform_apply
|
|
fi
|
|
# Deferred until after app stacks so both lists get a chance to run.
|
|
echo "$FAILED_PLATFORM_STACKS" > .platform_failed
|
|
|
|
# ── Apply app stacks (serial, with Vault advisory locks) ──
|
|
- |
|
|
FAILED_APP_STACKS=""
|
|
if [ -s .app_apply ]; then
|
|
echo "=== Applying app stacks (serial, locked) ==="
|
|
while read -r stack; do
|
|
echo "[$stack] Starting apply..."
|
|
ATTEMPT=0
|
|
while :; do
|
|
ATTEMPT=$((ATTEMPT + 1))
|
|
set +e
|
|
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
|
|
EXIT=$?
|
|
set -e
|
|
if [ $EXIT -eq 0 ]; then
|
|
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
|
|
fi
|
|
# Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
|
|
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
|
|
echo "[$stack] SKIPPED (locked by another session/run)"; break
|
|
fi
|
|
# Transient provider-download / Vault 5xx → bounded retry (see platform loop).
|
|
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
|
|
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
|
|
fi
|
|
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
|
|
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
|
|
done
|
|
done < .app_apply
|
|
fi
|
|
# Fail the step loudly so the pipeline `default` workflow state
|
|
# reflects reality — the service-upgrade agent and CI alert cascade
|
|
# both rely on this (see bd code-e1x). Lock-skipped stacks are NOT
|
|
# counted as failures.
|
|
FAILED_PLATFORM=$(cat .platform_failed 2>/dev/null | tr -d ' ')
|
|
if [ -n "$FAILED_PLATFORM" ] || [ -n "$FAILED_APP_STACKS" ]; then
|
|
echo "=== FAILED STACKS: platform=[$FAILED_PLATFORM ] apps=[$FAILED_APP_STACKS ] ==="
|
|
exit 1
|
|
fi
|
|
|
|
# ── Commit and push state changes ──
|
|
- |
|
|
mkdir -p ~/.ssh && ssh-keyscan -H github.com >> ~/.ssh/known_hosts 2>/dev/null
|
|
chmod 400 secrets/deploy_key
|
|
git add stacks/ state/ .woodpecker/ 2>/dev/null || true
|
|
git remote set-url origin git@github.com:ViktorBarzin/infra.git
|
|
git diff --cached --quiet && echo "No changes to commit" && exit 0
|
|
git commit -m "Woodpecker CI deploy [CI SKIP]"
|
|
GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git fetch origin master
|
|
if ! GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git rebase origin/master; then
|
|
echo "ERROR: Git rebase failed — state commits could not be pushed"
|
|
echo "Manual intervention required: pull, resolve conflicts, push"
|
|
GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git rebase --abort || true
|
|
exit 1
|
|
fi
|
|
GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master
|
|
|
|
# ── Slack notification ──
|
|
- |
|
|
PLATFORM_COUNT=$(wc -l < .platform_apply 2>/dev/null | tr -d ' ')
|
|
APP_COUNT=$(wc -l < .app_apply 2>/dev/null | tr -d ' ')
|
|
curl -s -X POST -H 'Content-type: application/json' \
|
|
--data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS} (platform:${PLATFORM_COUNT}, apps:${APP_COUNT})\"}" \
|
|
"$SLACK_WEBHOOK" || true
|
|
|
|
# Slack on failure (runs even if apply step fails)
|
|
- name: notify-failure
|
|
image: curlimages/curl
|
|
commands:
|
|
- |
|
|
curl -s -X POST -H 'Content-type: application/json' \
|
|
--data "{\"channel\":\"general\",\"text\":\":red_circle: Woodpecker CI: infra pipeline FAILED\"}" \
|
|
"$SLACK_WEBHOOK" || true
|
|
environment:
|
|
SLACK_WEBHOOK:
|
|
from_secret: slack_webhook
|
|
when:
|
|
status: [failure]
|