infra/.claude/skills/setup-project/scripts/stability-gate.sh
Viktor Barzin 5e9e487661 feat(setup-project): auto-PR working Dockerfiles back to upstream
## Context
The setup-project skill treats "build from a Dockerfile" as priority 6 — "last
resort, avoid if possible" — with no formalized path for apps whose upstream
lacks a working Dockerfile. When we end up writing one to get the deploy green,
that Dockerfile stays private in the infra repo and upstream never benefits.

## This change
Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken
upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against
the upstream repo so the self-hosting community gets the working recipe.

Flow:
1. Classify dockerfile_state during research phase (image-used / used-as-is /
   fixed-broken-upstream / written-from-scratch). Persist to
   modules/kubernetes/<service>/.contribution-state.json.
2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready +
   HTTP 200 every 30s x 20 iterations, requires 18/20 successes.
3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the
   GitHub API dance: fork → merge-upstream → branch → commit Dockerfile /
   .dockerignore / BUILD.md via Contents API → open PR with body rendered from
   templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork,
   existing branch, open PR, upstream landed a Dockerfile mid-deploy).

GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token
pulled from Vault (`secret/viktor` → `github_pat`). Commits include
Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile`
for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with
timestamp suffix on collision.

## Files
- SKILL.md — state classification table, quality bar checklist, §8b stability
  gate, §10 contribute-upstream step, checklist updates
- scripts/stability-gate.sh — 10-minute health probe
- scripts/contribute-dockerfile.sh — GitHub API orchestrator
- templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description
- templates/Dockerfile.README.md — BUILD.md template shipped with the PR

## What is NOT in this change
- No Woodpecker / GHA changes (skill-local flow).
- No auto-tracking of merge/reject outcomes upstream (manual follow-up).
- Not yet exercised end-to-end; first real-world run will validate the API
  dance. Plan to dry-run against a throwaway sink repo before pointing at a
  real upstream.

## Test Plan
### Automated
- bash -n on both scripts → pass
- Manual read-through of SKILL.md — step numbering coherent, existing
  §1-9 untouched semantics, new §8b/§10 reference real files

### Manual Verification
1. Next time setup-project onboards a Dockerfile-less app:
   - Confirm .contribution-state.json is written with `written-from-scratch`
   - Run stability-gate.sh — expect 18/20 passes on a healthy deploy
   - Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin
   - Verify contribution_pr_url is back-written to the state file
2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent)
3. Upstream-archived case: manually archive a test upstream → re-run →
   expect SKIP, no PR created

[ci skip]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 18:12:13 +00:00

71 lines
1.7 KiB
Bash
Executable file

#!/usr/bin/env bash
# 10-minute deploy stability gate for setup-project skill.
# Polls pod readiness + HTTP 200 on target URL every 30s for 20 iterations.
# Requires 18/20 probes to succeed (tolerates 2 blips for restarts/DNS propagation).
#
# Usage:
# stability-gate.sh <namespace> <app-label> <url>
#
# Example:
# stability-gate.sh myapp myapp https://myapp.viktorbarzin.me
#
# Exit codes:
# 0 - Stable (>=18/20 probes OK)
# 1 - Unstable (<18/20 probes OK)
# 2 - Usage error
set -u
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <namespace> <app-label> <url>" >&2
exit 2
fi
NS="$1"
APP="$2"
URL="$3"
TOTAL_PROBES=20
MIN_SUCCESSES=18
INTERVAL_SECONDS=30
ok_count=0
fail_count=0
echo "stability-gate: ns=$NS app=$APP url=$URL"
echo "stability-gate: $TOTAL_PROBES probes x ${INTERVAL_SECONDS}s (need $MIN_SUCCESSES/$TOTAL_PROBES)"
for i in $(seq 1 "$TOTAL_PROBES"); do
probe_ok=true
if ! kubectl wait --for=condition=Ready pod -l "app=$APP" -n "$NS" --timeout=25s >/dev/null 2>&1; then
probe_ok=false
fi
status=$(curl -sS -o /dev/null -w "%{http_code}" --max-time 10 "$URL" || echo "000")
if [ "$status" != "200" ]; then
probe_ok=false
fi
if [ "$probe_ok" = "true" ]; then
ok_count=$((ok_count + 1))
printf " probe %2d/%d: OK (http=%s)\n" "$i" "$TOTAL_PROBES" "$status"
else
fail_count=$((fail_count + 1))
printf " probe %2d/%d: FAIL (http=%s)\n" "$i" "$TOTAL_PROBES" "$status"
fi
if [ "$i" -lt "$TOTAL_PROBES" ]; then
sleep "$INTERVAL_SECONDS"
fi
done
echo "stability-gate: results ok=$ok_count fail=$fail_count"
if [ "$ok_count" -ge "$MIN_SUCCESSES" ]; then
echo "stability-gate: PASS"
exit 0
fi
echo "stability-gate: FAIL (need $MIN_SUCCESSES, got $ok_count)" >&2
exit 1